November 13, 2019

Colleen Murphy

Shanghai Open Infrastructure Forum and PTG

The Open Infrastructure Summit, Forum, and Project Teams Gathering was held last week in the beautiful city of Shanghai. The event was held in the spirit of cross-cultural collaboration and attendees arrived with the intention of bridging the gap with a usually faraway but significant part of the OpenStack community …

by Colleen Murphy at November 13, 2019 01:00 AM

Sean McGinnis

November 2019 OpenStack Board Notes

The Open Infrastructure Summit was held in mainland China for the first time the week of November 4th, 2019, in Shanghai. As usual, we took advantage of the opportunity of having so many members in one place by having a Board of Directors meeting on Sunday, November 3.

Attendance was a little lighter due to visa challenges, travel budgets, and other issues. But we still had a quorum with a lot of folks in the room, and I’m sure it was a nice change for our Chinese board members and others from the APAC region.

The original meeting agenda is published on the wiki as usual.

OSF Updates

Following the usual pattern, Jonathan Bryce kicked things off with an update of Foundation and project activity.

One interesting thing that really stood out to me, which Jonathan also shared the next day in the opening keynotes, as an analyst report putting OpenStack’s market at $7.7 billion in 2020. I am waiting for those slides to be published, but I think this really showed that despite the decrease in investment by companies in the development of OpenStack, its adoption and growth is stable and growing.

This was especially highlighted in China, with companies like China UnionPay, China Mobile, and other large companies from other industries increasing their use of OpenStack. And public clouds like Huawei and other local service providers basing their services on top of OpenStack.

I can definitely state from experience after that week, access to the typical big 3 public cloud providers in the US is a challenge through the Great Firewall. Being able to base your services on top of a borderless open source option like OpenStack is a great option with the current political pressures. A community-based solution, rather than a foreign tech company’s offerings, probably makes a lot of sense and is helping drive this adoption.

Of course, telecom adoption is still growing as well. I’m not as involved in that space, but it really seems like OpenStack is becoming the de facto standard for having a programmable infrastructure to base dynamic NFV solutions on top of, but directly with VMs and baremetal, and as a locally controlled platform to serve as the underlying infrastructure for Kubernetes.

Updates and Community Reports

StarlingX Progress Report

The StarlingX project has made a lot of progress over the last several months. They are getting closer and closer to the latest OpenStack code. They have been actively working on getting their custom changes merged upstream so they do not need to continue maintaining a fork. So far, they have been able to get a lot of changes in to various projects. They hope to eventually be able to just deploy standard OpenStack services configured to meet their needs, focusing instead on the services on top of OpenStack that make StarlingX attractive and a great solution for edge infrastructure.

Indian Community Update

Prakash Ramchandran gave an update on the various meetups and events being organized across India. This is a large market for OpenStack. Recently approved government initiatives could make this an ideal time to help nurture the Indian OpenStack community.

I’m glad to see all of the activity that Prakash has been helping support there. This is another region where I expect to see a lot of growth in OpenStack adoption.

Interop Working Group

Egle gave an update of the Interop WG activity and the second 2019 set of changes were approved. Nothing too exciting there, with just minor updates to the interop requirements.

The larger discussion was about the need and the health of the Interop WG. Chris Hoge was a very active contributor to this, but he recently left the OSF, and the OpenStack community, to pursue a different opportunty. Egle Sigler is really the only one left on the team, and she has shared that she would not be able to do much more with the group other than keeping the lights on.

This team is responsible for the guidelines that must be followed for someone to certify their service or distribution of OpenStack meets the minimum functionality requirements to be consistent with other OpenStack deployments. This is certification is needed to be able to use the OpenStack logo and be called “OpenStack Powered”.

I think there was pretty unanimous agreement that this kind of thing is still very important. Users need to be able to have a consistent user experience when moving between OpenStack-based clouds. Inconsistency would lead to unexpected behaviors or responses and a poor user experience.

For now it is a call for help and to raise awareness. It did make me think about how we’ve been able to decentralize some efforts within the community, like moving documentation into each teams repos rather than having a centralized docs team and docs repo. I wonder if we can put some of this work on the teams themselves to mark certain API calls as “core”, then some testing in place to ensure none of these set APIs are changed or start producing different results. Something to think about at least.

First Contact SIG Update

The First Contact SIG works on things to make getting involved in the community easier. They’ve done a lot of work in the past on training and contributor documentation. They’ve recently added a Contributing Organization Guide that is targeted at the organization management level to help them understand how they can make an impact and help their employees to be involved and productive.

That’s an issue we’ve had to varying degrees in the past. Companies have had good intentions of getting involved, but they are not always sure where to start. Or they task a few employees to contribute without a good plan on how or where to do so. I think it will be good having a place to direct these companies to, to help them understand how to work with OpenStack and an open source community.

Troila Gold Member Application

Troila is an IT services company in China that provides a cloud product based on OpenStack to their customers. They have been using OpenStack for some time and saw the value in becoming an OSF Gold level sponsor.

As part of the Member Committee, Rob Esker and I met with them the week prior to go over their application and answer any questions and give feedback. That preview was pretty good, and Rob and I only had minor suggestions for them to help highlight what they have been doing with OpenStack and what their future plans were.

They had taken these suggestions and made updates to their presentation, and I think they did a very nice job explaining their goals. There was some discussion and additional questions from the board, but after a quick executive session, we voted and approved Troila as the latest Gold member of the OpenStack Foundation.

Combined Leadership Meeting

The second half of the day was a joint session with the Board and the Technical Committees or Technical Steering Committees of the OpenStack, StarlingX, Airship, Kata, and Zuul projects. Each team gave a community update for their respective areas.

My biggest takeaway from this was that although we are unresources in some areas, we really do have a large and very active community of people that really care about the things they are working on. Seeing growing adoption for things like Kata Containers and Zuul is really exciting.

Next Meeting

The next meeting will be a conference call on December 10th. No word yet on the agenda for that, but I wouldn’t expect too much being so soon after Shanghai. I expect their will probably be some buzz about the annual elections coming up.

Once available, the agenda will be published to the usual spot.

I have the issue that I have been able to finish out my term because the rest of the board voted to allow me to do so as an exception to the two seat per company limit since I had rejoined Dell half way through the year. That won’t apply for the next election, so if the three of us from Dell all hope to continue, one of us isn’t going to be able to.

I’ve waffled on this a little, but at least right now, I do think I am going to run for election again. Prakash has been doing some great work with his participation in the India OpenStack community, so I will not feel too bad if I lose out to him. I do think I’ve been more integrated in the overall development community, so since an Individual Director is supposed to be a representative for the community, I do hope I can continue. That will be up to the broader community, so I am not going to worry about it. The community will be able to elect those they support, so no matter what it will be good.

by Sean McGinnis at November 13, 2019 12:00 AM

November 12, 2019

Sean McGinnis

Why is the Cinder mascot a horse?!

I have to admit, I have to laugh to myself every time I see the Cinder mascot in a keynote presentation.

Cinder horse mascot

History (or, why the hell is that the Cinder mascot!)

The reason at least a few of us find it so funny is that it’s a bit of an inside joke.

Way back in the early days of Cinder, someone from Solidfire came up with a great looking cinder block logo for the project. It was along the style if the OpenStack logo at the time and was nice and recognizable.

Cinder logo

Then around 2016, they decided it was time to refresh the OpenStack logo and make it look more modern and flat. Our old logo no longer matched the overall project, but we still loved it.

I did make an attempt to update it. I made a stylized version of the Cinder block logo using the new OpenStack logo as a basis for it. I really wish I could find it now, but I may have lost the image when I switched jobs. You may still see it on someone’s laptop - I had a very small batch of stickers made while I was still Cinder PTL.

It was soon after the OpenStack logo change that the Foundation decided to introduce mascots for each project. They were asking for each team to thing of an animal that they could identify with. It was supposed to be a fun exercise for the teams to be able to pick their own kind of logo, with graphic designers coming up with very high quality images.

The Cinder team didn’t really have an obvious animal. At least not as obvious as a Cinder block had been. It was during one of our midcycle meetups in Ft. Collins, Co while we were brainstorming that led to our horse.

Trying to think of something that would actually represent the team, we were talking over what Cinder actually was. We were mostly all from different storage vendors. We refer to the different storage devices that are used with Cinder as backends.

Backends are also what some call butts. Butts… asses. Donkeys are also called asses. Donkey!

One or two people on the team had cultural objects to having a donkey as a mascot. They didn’t think it was a good representation of our project. So we compromised with going with a horse.

So we asked for a horse to be our mascot. The initial design they came up with was a Ferrari looking stallion. Way to sporty and fierce for our team. Even though the OpenStack Foundation has actually published it and even created some stickers, we explained our, erm… thought process… behind coming up with the horse in the first place. The design team was great, and went back to the drawing board. The result is the back-end view of the horse that we have today. They even worked a little ‘C’ into the swish of the horse’s tail.

So that’s the story behind the Cinder logo. It’s just because we’re all a bunch of backends.

by Sean McGinnis at November 12, 2019 12:00 AM

November 11, 2019

RDO

Community Blog Round Up 11 November 2019

As we dive into the Ussuri development cycle, I’m sad to report that there’s not a lot of writing happening upstream.

If you’re one of those people waiting for a call to action, THIS IS IT! We want to hear about your story, your problem, your accomplishment, your analogy, your fight, your win, your loss – all of it.

And, in the meantime, Adam Young says it’s not that cloud is difficult, it’s networking! Fierce words, Adam. And a super fierce article to boot.

Deleting Trunks in OpenStack before Deleting Ports by Adam Young

Cloud is easy. It is networking that is hard.

Read more at https://adam.younglogic.com/2019/11/deleting-trunks-before-ports/

by Rain Leander at November 11, 2019 01:46 PM

November 09, 2019

Aptira

OSN-Day

Aptira OSN Day

The Open Networking technology landscape has evolved quickly over the last two years. How can Telco’s keep up?

Our team of Network experts have used Software Defined Networking techniques for many different use cases, including: Traffic EngineeringSegment RoutingIntegration and Automated Traffic Engineering and many more, addressing many of the key challenges associated with networks; including security, volume and flexibility concerns to provide customers with an uninterrupted user experience.

At OSN Day, we will be helping attendees to learn about the risks associated with 5G networks. Edge Compute is needed for 5G and 5G-enabled use cases, but currently 5G-enabled use cases are ill-defined and incremental revenue is uncertain. Therefore, it’s not clear what is actually required, and the Edge business case is risky. We’ll be on site explaining how to mitigate against these risks, ensuring successful network functionality through the implementation of a risk-optimised approach to 5G. You can download the full whitepaper here.

We will also have our amazingly talented Network Consultant Farzaneh Pakzad presenting in The Programmable Network breakout track. Farzaneh will be comparing, rating and evaluating each of the most popular Open Source SDN controllers in use todayThis comparison will be useful for organisations to help them select the right SDN controller for their platform which match their network design and requirements. 

Farzaneh has a PhD in Software Defined Networks from the University of Queensland. Her research interests include Software Defined Networks, Cloud Computing and Network Security. During her career, Farzaneh has provided advisory service for transport SDN solutions and implemented Software Defined Networking Wide Area Network functionalities for some of Australia’s largest Telco’s.

We’ve got some great swag to giveaway and will also be running a demonstration on Tungsten Fabric as a Kubernetes CNI, so if you’re at OSN Day make sure you check out Farzaneh’s session in Breakout room 2 and also visit the team of Aptira Solutionauts in the expo room. They can help you to create, design and deploy the network of tomorrow.

Ready to move your network into the software defined future?
Automate your network with ONAP.

Find Out How

The post OSN-Day appeared first on Aptira.

by Jessica Field at November 09, 2019 12:53 PM

November 07, 2019

Adam Young

Deleting Trunks in OpenStack before Deleting Ports

Cloud is easy. It is networking that is hard.

Red Hat supports installing OpenShift on OpenStack. As a Cloud SA, I need to be able to demonstrate this, and make it work for customers. As I was playing around with it, I found I could not tear down clusters due to a dependency issue with ports.


When building and tearing down network structures with Ansible, I had learned the hard way that there were dependencies. Routers came down before subnets, and so one. But the latest round had me scratching my head. I could not get ports to delete, and the error message was not a help.

I was able to figure out that the ports linked to security groups. In fact, I could unset almost all of the dependencies using the port set command line. For example:

openstack port set openshift-q5nqj-master-port-1  --no-security-group --no-allowed-address --no-tag --no-fixed-ip

However, I still could not delete the ports. I did notice that there was a trunk_+details section at the bottom of the port show output:

trunk_details         | {'trunk_id': 'dd1609af-4a90-4a9e-9ea4-5f89c63fb9ce', 'sub_ports': []} 

But there is no way to “unset” that. It turns out I had it backwards. You need to delete the port first. A message from Kristi Nikolla:

the port is set as the parent for a “trunk” so you need to delete the trunk firs

Kristi In IRC
<pre lang="bash">curl -H "x-auth-token: $TOKEN" https://kaizen.massopen.cloud:13696/v2.0/trunks/</pre>

It turns out that you can do this with the CLI…at least I could.

$ openstack network trunk show 01a19e41-49c6-467c-a726-404ffedccfbb
FieldValue
admin_state_up UP
created_at 2019-11-04T02:58:08Z
description
id 01a19e41-49c6-467c-a726-404ffedccfbb
name openshift-zq7wj-master-trunk-1
port_id 6f4d1ecc-934b-4d29-9fdd-077ffd48b7d8
project_id b9f1401936314975974153d78b78b933
revision_number 3
status DOWN
sub_ports
tags [‘openshiftClusterID=openshift-zq7wj’]
tenant_id b9f1401936314975974153d78b78b933
updated_at 2019-11-04T03:09:49Z

Here is the script I used to delete them. Notice that the status was DOWN for all of the ports I wanted gone.

for PORT in $( openstack port list | awk '/DOWN/ {print $2}' ); do TRUNK_ID=$( openstack port show $PORT -f json | jq  -r '.trunk_details | .trunk_id ') ; echo port  $PORT has trunk $TRUNK_ID;  openstack network trunk delete $TRUNK_ID ; done

Kristi had used the curl command because he did not have the network trunk option in his CLI. Turns out he needed to install python-neutronclient first.

by Adam Young at November 07, 2019 07:27 PM

November 06, 2019

StackHPC Team Blog

Worlds Collide: Virtual Machines & Bare Metal in OpenStack

Ironic's mascot, Pixie Boots

To virtualise or not to virtualise?

If performance is what you need, then there's no debate - bare metal still beats virtual machines; particularly for I/O intensive applications. However, unless you can guarantee to keep it fully utilised, iron comes at a price. In this article we describe how Nova can be used to provide access to both hypervisors and bare metal compute nodes in a unified manner.

Scheduling

When support for bare metal compute via Ironic was first introduced to Nova, it could not easily coexist with traditional hypervisor-based workloads. Reported workarounds typically involved the use of host aggregates and flavor properties.

Scheduling of bare metal is covered in detail in our bespoke bare metal blog article (see Recap: Scheduling in Nova).

Since the Placement service was introduced, scheduling has significantly changed for bare metal. The standard vCPU, memory and disk resources were replaced with a single unit of a custom resource class for each Ironic node. There are two key side-effects of this:

  • a bare metal node is either entirely allocated or not at all
  • the resource classes used by virtual machines and bare metal are disjoint, so we could not end up with a VM flavor being scheduled to a bare metal node

A flavor for a 'tiny' VM might look like this:

openstack flavor show vm-tiny -f json -c name -c vcpus -c ram -c disk -c properties
{
  "name": "vm-tiny",
  "vcpus": 1,
  "ram": 1024,
  "disk": 1,
  "properties": ""
}

A bare metal flavor for 'gold' nodes could look like this:

openstack flavor show bare-metal-gold -f json -c name -c vcpus -c ram -c disk -c properties
{
  "name": "bare-metal-gold",
  "vcpus": 64,
  "ram": 131072,
  "disk": 371,
  "properties": "resources:CUSTOM_GOLD='1',
                 resources:DISK_GB='0',
                 resources:MEMORY_MB='0',
                 resources:VCPU='0'"
}

Note that the vCPU/RAM/disk resources are informational only, and are zeroed out via properties for scheduling purposes. We will discuss this further later on.

With flavors in place, users choosing between VMs and bare metal is handled by picking the correct flavor.

What about networking?

In our mixed environment, we might want our VMs and bare metal instances to be able to communicate with each other, or we might want them to be isolated from each other. Both models are possible, and work in the same way as a typical cloud - Neutron networks are isolated from each other until connected via a Neutron router.

Bare metal compute nodes typically use VLAN or flat networking, although with the right combination of network hardware and Neutron plugins other models may be possible. With VLAN networking, assuming that hypervisors are connected to the same physical network as bare metal compute nodes, then attaching a VM to the same VLAN as a bare metal compute instance will provide L2 connectivity between them. Alternatively, it should be possible to use a Neutron router to join up bare metal instances on a VLAN with VMs on another network e.g. VXLAN.

What does this look like in practice? We need a combination of Neutron plugins/drivers that support both VM and bare metal networking. To connect bare metal servers to tenant networks, it is necessary for Neutron to configure physical network devices. We typically use the networking-generic-switch ML2 mechanism driver for this, although the networking-ansible driver is emerging as a promising vendor-neutral alternative. These drivers support bare metal ports, that is Neutron ports with a VNIC_TYPE of baremetal. Vendor-specific drivers are also available, and may support both VMs and bare metal.

Where's the catch?

One issue that more mature clouds may encounter is around the transition from scheduling based on standard resource classes (vCPU, RAM, disk), to scheduling based on custom resource classes. If old bare metal instances exist that were created in the Rocky release or earlier, they may have standard resource class inventory in Placement, in addition to their custom resource class. For example, here is the inventory reported to Placement for such a node:

$ openstack resource provider inventory list <node UUID>
+----------------+------------------+----------+----------+-----------+----------+--------+
| resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit |  total |
+----------------+------------------+----------+----------+-----------+----------+--------+
| VCPU           |              1.0 |       64 |        0 |         1 |        1 |     64 |
| MEMORY_MB      |              1.0 |   131072 |        0 |         1 |        1 | 131072 |
| DISK_GB        |              1.0 |      371 |        0 |         1 |        1 |    371 |
| CUSTOM_GOLD    |              1.0 |        1 |        0 |         1 |        1 |      1 |
+----------------+------------------+----------+----------+-----------+----------+--------+

If this node is allocated to an instance whose flavor requested (or did not explicitly zero out) standard resource classes, we will have a usage like this:

$ openstack resource provider usage show <node UUID>
+----------------+--------+
| resource_class |  usage |
+----------------+--------+
| VCPU           |     64 |
| MEMORY_MB      | 131072 |
| DISK_GB        |    371 |
| CUSTOM_GOLD    |      1 |
+----------------+--------+

If this instance is deleted, the standard resource class inventory will become available, and may be selected by the scheduler for a VM. This is not likely to end well. What we must do is ensure that these resources are not reported to Placement. This is done by default in the Stein release of Nova, and Rocky may be configured to do the same by setting the following in nova.conf:

[workarounds]
report_ironic_standard_resource_class_inventory = False

However, if we do that, then Nova will attempt to remove inventory from Placement resource providers that is already consumed by our instance, and will receive a HTTP 409 Conflict. This will quickly fill our logs with unhelpful noise.

Flavor migration

Thankfully, there is a solution. We can modify the embedded flavor in our existing instances to remove the standard resource class inventory, which will result in the removal of the allocation of these resources from Placement. This will allow Nova to remove the inventory from the resource provider. There is a Nova patch started by Matt Riedemann which will remove our standard resource class inventory. The patch needs pushing over the line, but works well enough to be cherry-picked to Rocky.

The migration can be done offline or online. We chose to do it offline, to avoid the need to deploy this patch. For each node to be migrated:

nova-manage db ironic_flavor_migration --resource_class <node resource class> --host <host> --node <node UUID>

Alternatively, if all nodes have the same resource class:

nova-manage db ironic_flavor_migration --resource_class <node resource class> --all

You can check the instance embedded flavors have been updated correctly via the database:

sql> use nova
sql> select flavor from instance_extra;

Now (Rocky only), standard resource class inventory reporting can be disabled. After the nova compute service has been running for a while, Placement will be updated:

$ openstack resource provider inventory list <node UUID>
+----------------+------------------+----------+----------+-----------+----------+-------+
| resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
+----------------+------------------+----------+----------+-----------+----------+-------+
| CUSTOM_GOLD    |              1.0 |        1 |        0 |         1 |        1 |     1 |
+----------------+------------------+----------+----------+-----------+----------+-------+

$ openstack resource provider usage show <node UUID>
+----------------+--------+
| resource_class |  usage |
+----------------+--------+
| CUSTOM_GOLD    |      1 |
+----------------+--------+

Summary

We hope this shows that OpenStack is now in a place where VMs and bare metal can coexist peacefully, and that even for those pesky pets, there is a path forward to this brave new world. Thanks to the Nova team for working hard to make Ironic a first class citizen.

by Mark Goddard at November 06, 2019 02:00 AM

November 04, 2019

Dan Smith

Start and Monitor Image Pre-cache Operations in Nova

When you boot an instance in Nova, you provide a reference to an image. In many cases, once Nova has selected a host, the virt driver on that node downloads the image from Glance and uses it as the basis for the root disk of your instance. If your nodes are using a virt driver that supports image caching, then that image only needs to be downloaded once per node, which means the first instance to use that image causes it to be downloaded (and thus has to wait). Subsequent instances based on that image will boot much faster as the image is already resident.

If you manage an application that involves booting a lot of instances from the same image, you know that the time-to-boot for those instances could be vastly reduced if the image is already resident on the compute nodes you will land on. If you are trying to avoid the latency of rolling out a new image, this becomes a critical calculation. For years, people have asked for or proposed solutions in Nova for allowing some sort of image pre-caching to solve this, but those discussions have always become stalled in detail hell. Some people have resorted to hacks like booting host-targeted tiny instances ahead of time, direct injection of image files to Nova’s cache directory, or local code modifications. Starting in the Ussuri release, such hacks will no longer be necessary.

Image pre-caching in Ussuri

Nova’s now-merged image caching feature includes a very lightweight and no-promises way to request that an image be cached on a group of hosts (defined by a host aggregate). In order to avoid some of the roadblocks to success that have plagued previous attempts, the new API does not attempt to provide a rich status result, nor a way to poll for or check on the status of a caching operation. There is also no scheduling, persistence, or reporting of which images are cached where. Asking Nova to cache one or more images on a group of hosts is similar to asking those hosts to boot an instance there, but without the overhead that goes along with it. That means that images cached as part of such a request will be subject to the same expiry timer as any other. If you want them to remain resident on the nodes permanently, you must re-request the images before the expiry timer would have purged them. Each time an image is pre-cached on a host, the timestamp for purge is updated if the image is already resident.

Obviously for a large cloud, status and monitoring of the cache process in some way is required, especially if you are waiting for it to complete before starting a rollout. The subject of this post is to demonstrate how this can be done with notifications.

Example setup

Before we can talk about how to kick off and monitor a caching operation, we need to set up the basic elements of a deployment. That means we need some compute nodes, and for those nodes to be in an aggregate that represents the group that will be the target of our pre-caching operation. In this example, I have a 100-node cloud with numbered nodes that look like this:

$ nova service-list --binary nova-compute
+--------------+--------------+
| Binary | Host |
+--------------+--------------+
| nova-compute | guaranine1 |
| nova-compute | guarnaine2 |
| nova-compute | guaranine3 |
| nova-compute | guaranine4 |
| nova-compute | guaranine5 |
| nova-compute | guaranine6 |
| nova-compute | guaranine7 |
.... and so on ...
| nova-compute | guaranine100 |
+--------------+--------------+

In order to be able to request that an image be pre-cached on these nodes, I need to put some of them into an aggregate. I will do that programmatically since there are so many of them like this:

$ nova aggregate-create my-application
+----+-----------------+-------------------+-------+----------+--------------------------------------+
| Id | Name | Availability Zone | Hosts | Metadata | UUID |
+----+-----------------+-------------------+-------+----------+--------------------------------------+
| 2 | my-application | - | | | cf6aa111-cade-4477-a185-a5c869bc3954 |
+----+-----------------+-------------------+-------+----------+--------------------------------------+
$ for i in seq 1 95; do nova aggregate-add-host my-application guaranine$i; done
... lots of noise ...

Now that I have done that, I am able to request that an image be pre-cached on all the nodes within that aggregate by using the nova aggregate-cache-images command:

$ nova aggregate-cache-images my-application c3b84ecf-43e9-4c6c-adfd-ab6db0e2bca2

If all goes to plan, sometime in the future all of the hosts in that aggregate will have fetched the image into their local cache and will be able to use that for subsequent instance creation. Depending on your configuration, that happens largely sequentially to avoid storming Glance, and with so many hosts and a decently-sized image, it could take a while. If I am waiting to deploy my application until all the compute hosts have the image, I need some way of monitoring the process.

Monitoring progress

Many of the OpenStack services send notifications via the messaging bus (i.e. RabbitMQ) and Nova is no exception. That means that whenever things happen, Nova sends information about those things to a queue on that bus (if so configured) which you can use to receive asynchronous information about the system.

The image pre-cache operation sends start and end versioned notifications, as well as progress notifications for each host in the aggregate, which allows you to follow along. Ensure that you have set [notifications]/notification_format=versioned in your config file in order to receive these. A sample intermediate notification looks like this:

{
'index': 68,
'total': 95,
'images_failed': [],
'uuid': 'ccf82bd4-a15e-43c5-83ad-b23970338139',
'images_cached': ['c3b84ecf-43e9-4c6c-adfd-ab6db0e2bca2'],
'host': 'guaranine68',
'id': 1,
'name': 'my-application',
}

This tells us that host guaranine68 just completed its cache operation for one image in the my-application aggregate. It was host 68 of 95 total. Since the image ID we used is in the images_cached list, that means it was either successfully downloaded on that node, or was already present. If the image failed to download for some reason, it would be in the images_failed list.

In order to demonstrate what this might look like, I wrote some example code. This is not intended to be production-ready, but will provide a template for you to write something of your own to connect to the bus and monitor a cache operation. You would run this before kicking off the process, it waits for a cache operation to begin, prints information about progress, and then exists with a non-zero status code if there were any errors detected. For the above example invocation, the output looks like this:

$ python image_cache_watcher.py
Image cache started on 95 hosts
Aggregate 'foo' host 95: 100% complete (8 errors)
Completed 94 hosts, 8 errors in 2m31s
Errors from hosts:
guaranine2
guaranine3
guaranine4
guaranine5
guaranine6
guaranine7
guaranine8
guaranine9
Image c3b84ecf-43e9-4c6c-adfd-ab6db0e2bca2 failed 8 times

In this case, I intentionally configured eight hosts so that the image download would fail for demonstration purposes.

Future

The image caching functionality in Nova may gain more features in the future, but for now, it is a best-effort sort of thing. With just a little bit of scripting, Ussuri operators should be able to kick off and monitor image pre-cache operations and substantially improve time-to-boot performance for their users.

by Dan at November 04, 2019 07:30 PM

Mirantis

How to build an edge cloud part 1: Building a simple facial recognition system

Learn about the basics of building an edge cloud -- and build a facial recognition system while you're at it.

by Nick Chase at November 04, 2019 07:06 PM

OpenStack Superuser

Baidu wins Superuser Award at Open Infrastructure Summit Shanghai

 The Baidu ABC Cloud Group & Security Edge teams is the 11th organization to win the Superuser Award. The news was announced today at the Open Infrastructure Summit in Shanghai. Baidu ABC Cloud Group and Edge Security Team integrated Kata Containers into the platform for all of Baidu internal and external cloud services including edge applications. Their cloud products, including both VMs and bare metal servers, cover 11 regions in China with over 5,000 physical machines. Today, 17 important online businesses have been migrated to the Kata Containers platform thus far.

Elected by members of the OSF community, the team that wins the Superuser Award is lauded for the unique nature of its use case as well as its integration and application of open infrastructure. Four out of five nominees for the Superuser Award presented today were from the APAC region: Baidu ABC Cloud Group and Edge Security Team, InCloud OpenStack Team of Inspur, Information Management Department of Wuxi Metro, and Rakuten Mobile Network Organization. Previous award winners from the APAC region include China Mobile, NTT Group, and the Tencent TStack Team.

Baidu Keynote at Open Infrastructure Summit

On the keynote stage in Shanghai, Baidu Cloud Senior Architect Zhang Yu explained that Kata Containers provides a virtual machine-like security mechanism at the container level, which gives their customers a great deal of confidence. When moving their business to a container environment, they have less concern. Kata Containers is compatible with the OCI standard and users can directly manage the new environment with popular management suites such as Kubernetes. Kata Containers is now an official project under the OpenStack Foundation, which gives the company confidence to invest in the project.

“Baidu is an amazing example of how open infrastructure starts with OpenStack,” said Mark Collier, COO of the OpenStack Foundation. “They’re running OpenStack at massive scale, combined with other open infrastructure technologies like Kata Containers and Kubernetes, and they’re doing it in production for business-critical workloads.”

*** Download the Baidu Kata Containers White Paper ***

The company has published a white paper titled, “The Application of Kata Containers in Baidu AI Cloud” available here.

The post Baidu wins Superuser Award at Open Infrastructure Summit Shanghai appeared first on Superuser.

by Allison Price at November 04, 2019 04:39 AM

November 02, 2019

StackHPC Team Blog

StackHPC joins the OpenStack Marketplace

In many areas, our participation in the OpenStack community is no secret.

One area we haven't focussed on is our commercial representation within the OpenStack Foundation. As described here, StackHPC works with clients to solve challenging problems with cloud infrastructure. Our business has been won through word of mouth.

Now our services can also be found in the OpenStack Marketplace.

John Taylor, StackHPC's co-founder and CEO, adds:

We are pleased to announce our OpenStack Foundation membership and inclusion in the OpenStack Marketplace. Our success in driving the HPC and Research Computing use-case in cloud has been in no small part coupled to working closely with the OpenStack Foundation and the open community it fosters. The era of hybrid cloud and the emergence of converged AI/HPC infrastructure and coupled workflows is now upon us, driving the need for architectures that seamlessly transition across these resources while not compromising on performance. We look forward to continuing our partnership with OpenStack through the Scientific SIG and to active participation within OpenStack projects.

Get in touch

If you would like to get in touch we would love to hear from you. Reach out to us via Twitter or directly via our contact page.

by Stig Telfer at November 02, 2019 09:00 AM

October 31, 2019

RDO

RDO Train Released

The RDO community is pleased to announce the general availability of the RDO build for OpenStack Train for RPM-based distributions, CentOS Linux and Red Hat Enterprise Linux. RDO is suitable for building private, public, and hybrid clouds. Train is the 20th release from the OpenStack project, which is the work of more than 1115 contributors from around the world.

The release is already available on the CentOS mirror network at http://mirror.centos.org/centos/7/cloud/x86_64/openstack-train/. While we normally also have the release available via http://mirror.centos.org/altarch/7/cloud/ppc64le/ and http://mirror.centos.org/altarch/7/cloud/aarch64/ – there have been issues with the mirror network which is currently being addressed via https://bugs.centos.org/view.php?id=16590.

The RDO community project curates, packages, builds, tests and maintains a complete OpenStack component set for RHEL and CentOS Linux and is a member of the CentOS Cloud Infrastructure SIG. The Cloud Infrastructure SIG focuses on delivering a great user experience for CentOS Linux users looking to build and maintain their own on-premise, public or hybrid clouds.

All work on RDO and on the downstream release, Red Hat OpenStack Platform, is 100% open source, with all code changes going upstream first.

PLEASE NOTE: At this time, RDO Train provides packages for CentOS7 only. We plan to move RDO to use CentOS8 as soon as possible during Ussuri development cycle so Train will be the last release working on CentOS7.

Interesting things in the Train release include:

  • Openstack Ansible, which provides ansible playbooks and roles for deployment, added murano support and fully migrated to systemd-journald from rsyslog. This project makes deploying OpenStack from source in a way that makes it scalable while also being simple to operate, upgrade, and grow.
  • Ironic, the Bare Metal service, aims to produce an OpenStack service and associated libraries capable of managing and provisioning physical machines in a security-aware and fault-tolerant manner. Beyond providing basic support for building software RAID and a myriad of other highlights, this project now offers a new tool for building ramdisk images, ironic-python-agent-builder.

Other improvements include:

  • Tobiko is now available within RDO! This project is an OpenStack testing framework focusing on areas mostly complementary to Tempest. While the tempest main focus has been testing OpenStack rest APIs, the main Tobiko focus would be to test OpenStack system operations while “simulating” the use of the cloud as the final user would. Tobiko’s test cases populate the cloud with workloads such as instances, allows the CI workflow to perform an operation such as an update or upgrade, and then runs test cases to validate that the cloud workloads are still functional.
  • Other highlights of the broader upstream OpenStack project may be read via https://releases.openstack.org/train/highlights.html.

Contributors
During the Train cycle, we saw the following new RDO contributors:

  • Joel Capitao
  • Zoltan Caplovic
  • Sorin Sbarnea
  • Sławek Kapłoński
  • Damien Ciabrini
  • Beagles
  • Soniya Vyas
  • Kevin Carter (cloudnull)
  • fpantano
  • Michał Dulko
  • Stephen Finucane
  • Sofer Athlan-Guyot
  • Gauvain Pocentek
  • John Fulton
  • Pete Zaitcev

Welcome to all of you and Thank You So Much for participating!

But we wouldn’t want to overlook anyone. A super massive Thank You to all 65 contributors who participated in producing this release. This list includes commits to rdo-packages and rdo-infra repositories:

  • Adam Kimball
  • Alan Bishop
  • Alex Schultz
  • Alfredo Moralejo
  • Arx Cruz
  • Beagles
  • Bernard Cafarelli
  • Bogdan Dobrelya
  • Brian Rosmaita
  • Carlos Goncalves
  • Cédric Jeanneret
  • Chandan Kumar
  • Damien Ciabrini
  • Daniel Alvarez
  • David Moreau Simard
  • Dmitry Tantsur
  • Emilien Macchi
  • Eric Harney
  • fpantano
  • Gael Chamoulaud
  • Gauvain Pocentek
  • Jakub Libosvar
  • James Slagle
  • Javier Peña
  • Joel Capitao
  • John Fulton
  • Jon Schlueter
  • Kashyap Chamarthy
  • Kevin Carter (cloudnull)
  • Lee Yarwood
  • Lon Hohberger
  • Luigi Toscano
  • Luka Peschke
  • marios
  • Martin Kopec
  • Martin Mágr
  • Matthias Runge
  • Michael Turek
  • Michał Dulko
  • Michele Baldessari
  • Natal Ngétal
  • Nicolas Hicher
  • Nir Magnezi
  • Otherwiseguy
  • Gabriele Cerami
  • Pete Zaitcev
  • Quique Llorente
  • Radomiropieralski
  • Rafael Folco
  • Rlandy
  • Sagi Shnaidman
  • shrjoshi
  • Sławek Kapłoński
  • Sofer Athlan-Guyot
  • Soniya Vyas
  • Sorin Sbarnea
  • Stephen Finucane
  • Steve Baker
  • Steve Linabery
  • Tobias Urdin
  • Tony Breeds
  • Tristan de Cacqueray
  • Victoria Martinez de la Cruz
  • Wes Hayutin
  • Yatin Karel
  • Zoltan Caplovic

The Next Release Cycle
At the end of one release, focus shifts immediately to the next, Ussuri, which has an estimated GA the week of 11-15 May 2020. The full schedule is available at https://releases.openstack.org/ussuri/schedule.html.

Twice during each release cycle, RDO hosts official Test Days shortly after the first and third milestones; therefore, the upcoming test days are 19-20 December 2019 for Milestone One and 16-17 April 2020 for Milestone Three.

Get Started
There are three ways to get started with RDO.

To spin up a proof of concept cloud, quickly, and on limited hardware, try an All-In-One Packstack installation. You can run RDO on a single node to get a feel for how it works.

For a production deployment of RDO, use the TripleO Quickstart and you’ll be running a production cloud in short order.

Finally, for those that don’t have any hardware or physical resources, there’s the OpenStack Global Passport Program. This is a collaborative effort between OpenStack public cloud providers to let you experience the freedom, performance and interoperability of open source infrastructure. You can quickly and easily gain access to OpenStack infrastructure via trial programs from participating OpenStack public cloud providers around the world.

Get Help
The RDO Project participates in a Q&A service at https://ask.openstack.org. We also have our users@lists.rdoproject.org for RDO-specific users and operrators. For more developer-oriented content we recommend joining the dev@lists.rdoproject.org mailing list. Remember to post a brief introduction about yourself and your RDO story. The mailing lists archives are all available at https://mail.rdoproject.org. You can also find extensive documentation on RDOproject.org.

The #rdo channel on Freenode IRC is also an excellent place to find and give help.

We also welcome comments and requests on the CentOS devel mailing list and the CentOS and TripleO IRC channels (#centos, #centos-devel, and #tripleo on irc.freenode.net), however we have a more focused audience within the RDO venues.

Get Involved
To get involved in the OpenStack RPM packaging effort, check out the RDO contribute pages, peruse the CentOS Cloud SIG page, and inhale the RDO packaging documentation.

Join us in #rdo and #tripleo on the Freenode IRC network and follow us on Twitter @RDOCommunity. You can also find us on Facebook and YouTube.

by Rain Leander at October 31, 2019 04:18 PM

October 29, 2019

Galera Cluster by Codership

Galera Cluster for MySQL 5.6.46 and MySQL 5.7.28 is GA

Codership is pleased to announce a new Generally Available (GA) release of Galera Cluster for MySQL 5.6 and 5.7, consisting of MySQL-wsrep 5.6.46 (release notes, download) and MySQL-wsrep 5.7.28 (release notes, download). There is no Galera replication library release this time, so please continue using the 3.28 version, implementing wsrep API version 25.

This release incorporates all changes to MySQL 5.6.46 and 5.7.28 respectively and can be considered an updated rebased version. It is worth noting that we will have some platforms reach end of life (EOL) status, notably OpenSUSE 13.2 and Ubuntu Trusty 14.04.

You can get the latest release of Galera Cluster from https://www.galeracluster.com. There are package repositories for Debian, Ubuntu, CentOS, RHEL, OpenSUSE and SLES. The latest versions are also available via the FreeBSD Ports Collection.

by Colin Charles at October 29, 2019 12:42 PM

October 28, 2019

Mirantis

53 Things to look for in OpenStack Train

Now that OpenStack Train has been released, here are some features to look for.

by Nick Chase at October 28, 2019 04:27 PM

October 24, 2019

OpenStack Superuser

Using GitHub and Gerrit with Zuul: A leboncoin case study

Described as an online flea market, leboncoin is a portal that allows individuals to buy and sell new and used goods online in their local communities.  Leboncoin is one of the top ten searched websites in France, following Google, Facebook, and YouTube to name a few.

We got talking with Guillaume Chenuet to get some answers to why Leboncoin chose Zuul and how they use it with GitHub, Gerrit, and OpenStack.  

How did your organization get started with Zuul?

We started using Zuul, an open source CI tool, two years ago with Zuulv2 and Jenkins. At the beginning, we only used Gerrit and Jenkins, but as new developers joined leboncoin each new day, this solution was not enough. After some research and a proof-of-concept, we gave Zuul a try, running between Gerrit and Jenkins. In less than a month (and without an official thick documentation) we’ve setup a complete new stack. We ran it for a year before moving to Zuulv3. Zuulv3 is more complex in terms of setup but brings us more features using up-to-date tools like Ansible or OpenStack.

Describe how you’re using it:

We’re using Zuulv3 with Gerrit. Our workflow is close to the OpenStack one. For each review, Zuul is trigger on three “checks” pipelines: quality, integration and build. Once results are correct, we use the gate system to merge the code into repositories and build artifacts.

We are using two small OpenStack clusters (3 CTRL / 3 STRG / 5 COMPUTE) on each datacenter. Zuul is currently setup on all Gerrit projects and some GitHub projects too. Below, is our Zuulv3 infrastructure in production and in the case of datacenter loss.

 

Zuulv3 infrastructure in production.

 

Zuulv3 infrastructure in the case of DC loss.

What is your current scale?

In terms of compute resources, we currently have 480 cores, 1.3To Ram and 80To in our Ceph clusters available. In terms of jobs, we ran around 60,000 jobs per month which means ~around 2,500 jobs per day. Jobs average time is less than 5 minutes.

 

What benefits has your organization seen from using Zuul?

As leboncoin is growing very fast (and microservices too 🙂 ), Zuul allows us to ensure everything can be tested and at scale. Zuul is also able to work with Gerrit and GitHub which permits us to open our CI to more teams and workflows.

What have the challenges been (and how have you solved them)?

Our big challenge was to migrate from Zuulv2 to Zuulv3. Even if everything is using Ansible, it was very tiresome to migrate all our CI jobs (around 500 Jenkins jobs). With the help of Zuul guys on IRC, we used some Ansible roles and playbooks used by OpenStack but migration time was about a year.

What are your future plans with Zuul?

Our next steps are to use Kubernetes backend for small jobs like linters and improve Zuul with GitHub.

How can organizations who are interested in Zuul learn more and get involved?

Coming from OpenStack, I think meeting the community at Summits or on IRC is a good start. But Zuul needs better visibility. It is a powerful tool but the information online is limited.

Are there specific features that drew you to Zuul?

Scalability! And also ensuring than every commit merge into the repository is clean and can’t be broken.

What would you request from the Zuul upstream community?

Work on a better integration to Gerrit 3, new nodepool features and provider, a full HA and more visibility on the Internet.

 

Cover image courtesy of Guillaume Chenuet.

The post Using GitHub and Gerrit with Zuul: A leboncoin case study appeared first on Superuser.

by Ashleigh Gregory at October 24, 2019 02:00 PM

October 23, 2019

Corey Bryant

OpenStack Train for Ubuntu 18.04 LTS

The Ubuntu OpenStack team at Canonical is pleased to announce the general availability of OpenStack Train on Ubuntu 18.04 LTS via the Ubuntu Cloud Archive. Details of the Train release can be found at:  https://www.openstack.org/software/train

To get access to the Ubuntu Train packages:

Ubuntu 18.04 LTS

You can enable the Ubuntu Cloud Archive pocket for OpenStack Train on Ubuntu 18.04 installations by running the following commands:

    sudo add-apt-repository cloud-archive:train
    sudo apt update

The Ubuntu Cloud Archive for Train includes updates for:

aodh, barbican, ceilometer, ceph (14.2.2), cinder, designate, designate-dashboard, dpdk (18.11.2), glance, gnocchi, heat, heat-dashboard, horizon, ironic, keystone, libvirt (5.4.0), magnum, manila, manila-ui, mistral, murano, murano-dashboard, networking-arista, networking-bagpipe, networking-bgpvpn, networking-hyperv, networking-l2gw, networking-mlnx, networking-odl, networking-ovn, networking-sfc, neutron, neutron-dynamic-routing, neutron-fwaas, neutron-lbaas, neutron-lbaas-dashboard, neutron-vpnaas, nova, octavia, openstack-trove, openvswitch (2.12.0), panko, placement, qemu (4.0), sahara, sahara-dashboard, senlin, swift, trove-dashboard, vmware-nsx, watcher, and zaqar.

For a full list of packages and versions, please refer to:

http://reqorts.qa.ubuntu.com/reports/ubuntu-server/cloud-archive/train_versions.html

Python support

The Train release of Ubuntu OpenStack is Python 3 only; all Python 2 packages have been dropped in Train.

Branch package builds

If you would like to try out the latest updates to branches, we deliver continuously integrated packages on each upstream commit via the following PPA’s:

    sudo add-apt-repository ppa:openstack-ubuntu-testing/mitaka
    sudo add-apt-repository ppa:openstack-ubuntu-testing/ocata
    sudo add-apt-repository ppa:openstack-ubuntu-testing/queens
    sudo add-apt-repository ppa:openstack-ubuntu-testing/rocky
    sudo add-apt-repository ppa:openstack-ubuntu-testing/train

Reporting bugs

If you have any issues please report bugs using the ‘ubuntu-bug’ tool to ensure that bugs get logged in the right place in Launchpad:

    sudo ubuntu-bug nova-conductor

Thanks to everyone who has contributed to OpenStack Train, both upstream and downstream. Special thanks to the Puppet OpenStack modules team and the OpenStack Charms team for their continued early testing of the Ubuntu Cloud Archive, as well as the Ubuntu and Debian OpenStack teams for all of their contributions.

Enjoy and see you in Ussuri!

Corey
(on behalf of the Ubuntu OpenStack team)

by coreycb at October 23, 2019 12:56 PM

October 22, 2019

RDO

Cycle Trailing Projects and RDO’s Latest Release Train

The RDO community is pleased to announce the general availability of the RDO build for OpenStack Train for RPM-based distributions, CentOS Linux and Red Hat Enterprise Linux. RDO is suitable for building private, public, and hybrid clouds. Train is the 20th release from the OpenStack project, which is the work of more than 1115 contributors from around the world.

The release is already available on the CentOS mirror network at http://mirror.centos.org/centos/7/cloud/x86_64/openstack-train/.

BUT!

This is not the official announcement you’re looking for.

We’re doing something a little different this cycle – we’re waiting for some of the “cycle-trailing” projects that we’re particularly keen about, like TripleO and Kolla, to finish their push BEFORE we make the official announcement.

Photo by Denis Chick on Unsplash

Deployment and lifecycle-management tools generally want to follow the release cycle, but because they rely on the other projects being completed, they may not always publish their final release at the same time as those projects. To that effect, they may choose the cycle-trailing release model.

Cycle-trailing projects are given an extra three months after the final release date to request publication of their release. They may otherwise use intermediary releases or development milestones.

While we’re super hopeful that these cycle trailing projects will be uploaded to the CentOS mirror before OpenInfrastructure Summit Shanghai, we’re going to do the official announcement just before the Summit with or without the packages.

We’ve got a lot of people to thank!

Do you like that we’re waiting a bit for our cycle trailing projects or would you prefer the official announcement as soon as the main projects are available? Let us know in the comments and we may adjust the process for future releases!

In the meantime, keep an eye here or on the mailing lists for the official announcement COMING SOON!

by Rain Leander at October 22, 2019 02:34 PM

Sean McGinnis

October 2019 OpenStack Board Notes

Another OpenStack Foundation Board of Directors meeting was held October 22, 2019. This meeting was added primarily as to discuss the Airship’s request to for confirmation to become an official project.

The meeting agenda is published on the wiki.

OSF Updates

Jonathan Bryce gave a quick update on the OpenStack Train release that went out last week. The number of contributors, variety of companies, and overall commit numbers were pretty impressive. There were over 25,500 merged commits in Train, with 1,125 unique developers from 165 different organizations. With commits over the last cycle, OpenStack is still one of the top three active open source projects out there, after the Linux kernel and Chromium.

Jonathan also reiterated that the event structure will be different starting in 2020. The first major event planned is in Vancouver, June 8. This will be more of a collaborative event, so expect the format to be different than past Summits. I’m thinking more Project Teams Gathering than Summit.

Airship Confirmation

Matt McEuen, Alex Hughes, Kaspar Skels, and Jay Ahn went through the Airship Confirmation presentation and answered questions about the project and their roadmap. Overall, really pretty impressive what the Airship community has been able to accomplish so far.

The Airship mission statement is:

Openly collaborate across a divers, global community to provide and integrate a collection of loosely coupled but interoperable, open source tools taht declaratively automates cloud lifecycle management.

They started work inside of openstack-helm and kept to the OpenStack community Four Opens right from the start.

Project Diversity

The project was started by AT&T, so there is still a lot of work being done (code reviews, commits, etc.) from the one company, but the trend over the last couple of years has been really good, trending towards more and more contributor diversity.

They also have good policies in place to make sure the Technical Committee and Working Committee have no more than two members from the same company. Great to see this policy in place to really encourage more diversity in the spots where overall project decisions are made. Kudos to the AT&T folks for not only getting things started, but driving a lot of change while still actively encouraging others so it is not a one company show. It can be hard for some companies to realize that giving up absolute control is a good thing, especially when it comes to an open source community.

Community Feedback

Part of the confirmation process is to make sure the existing OSF projects are comfortable with the new project. There was feedback from the Zuul project and from the OpenStack TC. Rico Lin went through the TC feedback in the meeting. Only minor questions or concerns were raised there, and Matt was able to respond to most of them in the meeting. He did state he would respond to the mailing list so there was a record there of the responses.

Licensing

Really the only point of concern was raised at the end. One difference between Airship and other OpenStack projects is that it is written in Go. Go has a great system built in to be able to easily use modules written by others. But that led to the question of licensing.

The OSF bylaws state:

The Board of Directors may approve a license for an Open Infrastructure Project other than Apache License 2.0, but such license must be a license approved by the Open Source Initiative at the date of adoption of such license.

The Airship code itself is Apache 2.0. But there isn’t anything done today to vet the dependencies that are pulled in to actually compile the project. The concern is the copyleft licenses usually have provisions that if they are pulled in and linked to non-copyleft code, it then makes that code fall under the copyleft requirements. So the only concern was that it just wasn’t known what the effective license of the project is today based on what is being pulled in.

It can be a very tricky area that definitely requires involvement of lawyers that understand copyright law and open source licensing. Luckily it wasn’t a show stopper. We moved to add the project and have them work with OSF legal to better understand the licensing impacts and resolve any concerns by using different dependencies if any are found to be licensed with something that would impose copyleft into Airship. The board unanimously voted in favor of Airship becoming a fully official Open Infrastructure Project.

Next Meeting

The next OSF board meeting will take place November 3rd, in Shanghai, the day before the Open Infrastructure Summit.

by Sean McGinnis at October 22, 2019 12:00 AM

October 21, 2019

OpenStack Superuser

Collaboration across Boundaries: Open Culture Drives Evolving Tech

This past summer marked a pinnacle in OpenStack’s history — the project’s ninth birthday — a project that epitomizes collaboration without boundaries. Communities comprised of diverse individuals and companies united around the globe to celebrate this exciting milestone, from Silicon Valley to the Pacific Northwest to the Far East. Participants from communities that spanned OpenStack, Kubernetes, Kata Containers, Akraino, Project ACRN and Clear Linux — and represented nearly 60 organizations — shared stories about their collective journey, and looked towards the future.

An Amazing Journey

The Shanghai event brought together several organizations, including 99Cloud, China Mobile, Intel, ZStack, East China Normal University, Shanghai Jiaotong University and Tongji University, as well as the OpenStack Foundation.

Individual OpenStack Foundation board director, Shane Wang, talked about OpenStack’s history. What began as an endeavor to bring greater choice in cloud solutions to users, combining Nova for compute from NASA with Swift for object storage from Rackspace, has since grown into a strong foundation for open infrastructure. The project is supported by one of the largest, global open source communities of 105,000 members in 187 countries from 675 organizations, backed by over 100 member companies.

“After nine years of development, OpenStack and current Open Infrastructure have attracted a large number of enterprises, users, developers and enthusiasts to join the community, and together we’ve initiated new projects and technologies to address emerging scenarios and use cases,” said Shane. “Through OpenStack and Open Infrastructure, businesses can realize healthy profits, users can satisfy their needs, innovations can be incubated through a thriving community, and individuals can grow their skills and talents. These are the reasons that the community stays strong and popular.”

Truly representative of cross-project collaboration, this Open Infrastructure umbrella now encompasses components that can be used to address existing and emerging use cases across data center and edge. Today’s applications span enterprise, artificial intelligence, machine learning, 5G networking and more. Adoption ranges from retail, financial services, academia and telecom to manufacturing, public cloud and transportation.

 

 

Junwei Liu, OpenStack board member from China Mobile, winner of the 2016 SuperUser award, joined the birthday celebration. He reflected on OpenStack’s capability to address existing and emerging business needs: “Since 2015, China Mobile, a leading company in the cloud industry, has built a public cloud, private cloud and networking cloud for internal and external customers based on OpenStack. OpenStack has been proven mature enough to meet the needs of core business and has become the de facto standard of IaaS resource management. The orchestration systems integrating Kubernetes into OpenStack as the core will be the most controllable and the most suitable cloud computing platform which meets enterprises’ own business needs.”

Ruoyu Ying, Cloud Software Engineer at Intel, reflected on the various release names, and summits, over the years. There have been several exciting milestones along the way: the inaugural release and summit, both bearing the name, Austin, to commemorate OpenStack’s birthplace in Austin, Texas; the fourth release, Diablo, which established a bi-annual release frequency and expanded the summit outside of Texas to Santa Clara, California; the ninth release, Icehouse, which heralded a move of the summit outside North America to Hong Kong and invited more developers from Asia to contribute; the eleventh release, Kilo, which expanded the summit into Europe, specifically Paris, France; the 17th release, Queens, that saw the summit move into the southern hemisphere in Sydney, Australia; and ultimately, the 20th release, Train, with the vital change in summit name to OpenInfra to accurately reflect the evolution in the project and community.


In November, the summit will be held on mainland China for the first time, and the team there is looking forward to welcoming the global community with open arms!

Collaboration across Boundaries

Meetups across Silicon Valley and the Pacific Northwest, which were sponsored by Intel, Portworx, Rancher Labs and Red Hat, personified collaboration across projects and communities. Individuals from the OpenStack, Kubernetes, Kata Containers, Akraino, Clear Linux and Project ACRN communities — representing over 50 organizations — came together to celebrate this special milestone with commemorative birthday cupcakes and a strong lineup of presentations focused on emerging technologies and use cases.

Containers and container orchestration technologies were highlights, as Jonathan Gershater, Senior Principal Product Marketing Manager at Red Hat, talked about how to deploy, orchestrate and manage enterprise Kubernetes on OpenStack, while Gunjan Patel, Cloud Architect at Palo Alto Networks, talked about the full lifecycle of a Kubernetes pod. Rajashree Mandaogane, Software Engineer at Rancher Labs, and Oksana Chuiko, Software Engineer at Portworx, delivered lightning talks focused on Kubernetes. Eric Ernst, Kata Containers’ Technical Lead and Architecture Committee Member, and Senior Software Engineer at Intel, talked about running container solutions with the extra isolation provided by Kata Containers, while Manohar Castelino and Ganesh Mahalingam, Software Engineers at Intel, gave demos of many of Kata Containers’ newest features.

Edge computing and IoT were also hot topics. Zhaorong Hou, Virtualization Software Manager at Intel, talked about how Project ACRN addresses the need for lightweight hypervisors in booming IoT development, while Srinivasa Addepalli, Senior Principal Software Architect at Intel, dove into one of the blueprints set forth by the Akraino project—the Integrated Cloud Native Stack—and how it addresses edge deployments for both network functions and application containers.

Beatriz Palmeiro, Community and Developer Advocate at Intel, engaged attendees in a discussion about how to collaborate and contribute to the Clear Linux project, while Kateryna Ivashchenko, Marketing Specialist at Portworx, provided us all with an important reminder about how not to burn out in tech.

Open Culture Drives Evolving Tech

There is incredible strength in the OpenStack community. As noted at the Shanghai event, OpenStack powers open infrastructure across data centers and edge, enabling private and hybrid cloud models to flourish. This strength is due, in part, to the amazing diversity within the OpenStack community.

Throughout its history, OpenStack has been committed to creating an open culture that invites diverse contributions. This truth is evident in many forms: diversity research, representation of women on the keynote stage and as speakers across the summits, speed mentoring workshops, diversity luncheons and more. The breadth of allies and advocates for underrepresented minorities abounds in our community, from Joseph Sandoval, who keynoted at the Berlin summit to talk about the importance of projects like OpenStack in enabling diversity, to Tim Berners-Lee, who participated in the speed mentoring workshop in Berlin, to Lisa-Marie Namphy, who organized and hosted the event in the Silicon Valley and made sure that over 50% of her presenters were women, among many others.

“OpenStack is a strategic platform that I believe will enable diversity.” — Joseph Sandoval, OpenStack User Committee Member and SRE Manager, Infrastructure Platform, Adobe

As OpenStack evolves as the foundation for the open infrastructure, and new projects and technologies emerge to tackle the challenges of IoT, edge and other exciting use cases, diversity — in gender, race, perspective, experience, expertise, skill set, and more — becomes increasingly important to the health of our communities. From developers and coders to community and program managers, ambassadors, event and meetup organizers, and more, it truly takes a village to sustain a community and ensure the health of a project!

Early OpenStack contributor, community architect, and OpenStack Ambassador Lisa-Marie Namphy reflected on OpenStack’s evolution and what she’s most excited about looking forward. As organizer of the original San Francisco Bay Area User Group, which has now expanded beyond just OpenStack to reflect the broader ecosystem of Cloud Native Containers, she has established one of the largest OpenStack & Cloud Native user groups in the world. “Our user group has always committed to showcasing the latest trends in cloud native computing, whether that was OpenStack, microservices, serverless, open networking, or our most exciting recent trend: containers! In response to our passionate and vocal community members, we’ve added more programming around Kubernetes, Istio, Kata Containers and other projects representing the diversity of the open infrastructure ecosystem. It’s as exciting as ever to be a part of this growing open cloud community!” Lisa now works as Director of Marketing at Portworx, contributing to the OpenStack, Kubernetes, and Istio communities.

Looking Forward

As we blow out the birthday candles, we’d like to thank the organizers, sponsors, contributors and participants of these meetup — with a special thank you to Kari Fredheim, Liz Aberg, Liz Warner, Sujata Tibrewala, Lisa-Marie Namphy, Maggie Liang, Shane Wang and Ruoyu Ying.

As we look forward, the OpenStack Foundation has just revealed the name of the project’s next release — Ussuri, a river in China — commemorative of the summit’s next location in Shanghai. “The river teems with different kinds of fish: grayling, sturgeon, humpback salmon (gorbusha), chum salmon (keta), and others.”1 A fitting name to embody diverse projects, communities and technologies working in unison to further innovation!

***

1 Source: https://en.wikipedia.org/wiki/Ussuri_River

 

 

The post Collaboration across Boundaries: Open Culture Drives Evolving Tech appeared first on Superuser.

by Nicole Huesman at October 21, 2019 02:00 PM

RDO

Community Blog Round Up 21 October 2019

Just in time for Halloween, Andrew Beekhof has a ghost story about the texture of hounds.

But first!

Where have all the blog round ups gone?!?

Well, there’s the rub, right?

We don’t usually post when there’s one or less posts from our community to round up, but this has been the only post for WEEKS now, so here it is.

Thanks, Andrew!

But that brings us to another point.

We want to hear from YOU!

RDO has a database of bloggers who write about OpenStack / RDO / TripleO / Packstack things and while we’re encouraging those people to write, we’re also wondering if we’re missing some people. Do you know of a writer who is not included in our database? Let us know in the comments below.

Photo by Jessica Furtney on Unsplash

Savaged by Softdog, a Cautionary Tale by Andrew Beekhof

Hardware is imperfect, and software contains bugs. When node level failures occur, the work required from the cluster does not decrease – affected workloads need to be restarted, putting additional stress on surviving peers and making it important to recover the lost capacity.

Read more at http://blog.clusterlabs.org/blog/2019/savaged-by-softdog

by Rain Leander at October 21, 2019 09:17 AM

October 18, 2019

OpenStack Superuser

OpenStack Ops Meetup Features Ceph, OpenStack Architectures and Operator Pain Points

Bloomberg recently hosted an OpenStack Ops Meetup in one of its New York engineering offices on September 3 and 4. The event was well attended with between 40 and 50 attendees, primarily from North America, with a few people even traveling from Japan!

The OpenStack Ops Meetups team was represented by Chris Morgan (Bloomberg), Erik McCormick (Cirrus Seven) and Shintaro Mizuno (NTT). In addition to this core group, other volunteer moderators that lead sessions included Matthew Leonard (Bloomberg), Martin Gehrke (Two Sigma), David Medberry (Red Hat), Elaine Wong-Perry (Verizon), Assaf Muller (Red Hat), David Desrosiers (Canonical), and Conrad Bennett (Verizon) with many others contributing. The official meetups team is rather small, so volunteer moderators make such events come alive and we couldn’t make them happen without all of you, thanks to everyone that helped.

An interesting topic that Bloomberg brought up at this meetup was the concept of expanding the Ceph content. Ceph is a very popular storage choice in production-quality OpenStack deployments, which is shown by the OpenStack user survey and by the fact that Ceph sessions at previous meetups have always been very popular. Bloomberg’s Matthew Leonard suggested to those attending the first Ceph session that we build upon this with more Ceph sessions, and perhaps even launch a separate Ceph operators meetup series in the future. Some of this discussion was captured here. Matthew also lead a group discussion around a deeper technical dive into challenging use cases for Ceph, such as gigantic (multi-petabyte) object stores using Ceph’s RadosGW interface. It’s a relief that we are not the only ones hitting certain technical issues at this scale.

Response from the Ceph users at the meetup was positive and we will seek to expand Ceph content at the next event.

Other evergreen topics for OpenStack operators include deployment/upgrades, upgrades/long-term support, monitoring, testing and billing. These all saw some spirited debate and exchanging of experience. The meetups team also shared some things that the ops community can point to as positive changes we have achieved, such as the policy changes allowing longer retention of older OpenStack documentation and maintenance branches.

To make the event a bit more fun, the meetups team always includes lightning talks at the end of each day. Day 1 saw an “arch show and tell” where those who were willing grabbed a microphone and talked about the specific architecture of their cloud. The variety of OpenStack architectures, use cases, market segments is astonishing.

On day 2, many of the most noteworthy sessions were again moderated by volunteers. Assaf Muller from Red Hat lead an OpenStack networking state of the union discussion, with a certain amount of RDO focus, although not exclusively. Later on Martin Gehrke from Two Sigma ran a double session covering choosing appropriate workloads for your cloud, and then one on reducing OpenStack toil.

As a slight change of pace, David Desrosiers demonstrated a lightning fast developer build of OpenStack using Canonical’s nifty “microstack” snap install of an all-in-one OpenStack instance, although our guest wifi picked this exact moment to pitch a fit – sorry David!

The final technical session of the event was another lightning talk, this time asking the guests to recount their best “ops war stories”. The organizers strongly encouraged everyone to participate, and later on revealed why – we arranged for a lighthearted scoring system and eventually awarded a winner (chosen by the attendees). There were even some nominal prizes! David Medberry moderated this session and it was a fun way to finish off the event.

The overall winner was Julia Kreger from Red Hat, who shared with us a story about “it must be a volume knob?” – it seems letting visitors near the power controls in the data center isn’t a great idea? Well, let’s just say it’s probably best if you try and hear Julia tell it in person!

The above gives just a brief flavor of the event and sorry for those sessions and moderators I didn’t mention. The next OpenStack Ops Meetup is expected to be somewhere in Europe in the first quarter of 2020.

Cover Photo courtesy of David Medberry

The post OpenStack Ops Meetup Features Ceph, OpenStack Architectures and Operator Pain Points appeared first on Superuser.

by Chris Morgan at October 18, 2019 02:00 PM

October 17, 2019

Mirantis

Tips for taking the new OpenStack COA (Certified OpenStack Administrator) exam – October 2019

Mirantis will be providing resources to the OpenStack Foundation, including becoming the new administrators of the upgraded Certified OpenStack Administrator (COA) exam

by Nick Chase at October 17, 2019 01:36 AM

Sean McGinnis

September 2019 OpenStack Board Notes

There was another OpenStack Foundation Board of Directors conference call on September 10, 2019. There were a couple of significant updates during this call. Well, at least one significant for the community, and one that was significant to me (more details below).

In case this is your first time reading my BoD updates, just a reminder that upcoming and past OSF board meeting information is published on the wiki and the meetings are open to everyone. Occasionally there is a need to have a private, board member only portion of the call to go over any legal affairs that can’t be discussed publicly, but that should be a rare occasion.

September 10, 2019 OpenStack Foundation Board Meeting

The original agenda can be found here. Usually there are official and unofficial notes sent out, but at least at this time, it doesn’t appear Jonathan has been able to get to that. Watch for that to show up on the wiki page referenced in the previous section.

Director Changes

There were a couple of changes in the assigned Platinum Director seats. The Platinum level sponsors are the only seats on the board that are guaranteed to the sponsor and allows them to assign a Director. So no change in sponsorships at this point, just a couple of internal personel changes that led to these changes.

With all the churn and resulting separation of Futurewei in the US from the rest of Huawei, their chair seat was moved over to Fred Li. I worked with Fred quite a bit during my time with the company. He’s a great guy and has put in a lot of work, mostly behind the scenes, to support OpenStack. Really happy to be able to work with him again. Anni has also done a lot over the years, so sad to see her go. I’m sure she will be quite busy on new things though.

On the Red Hat side, Mark McLoughlin has transitioned out, handing things over to Daniel Becker. It sounds like with the internal structure at Red Hat, Daniel is now the better representative for the OpenStack Foundation. I personally didn’t get a lot of opportunity to work with Mark, but I know he has been around for a long time and has done a lot of great things, so I’m a little sad to see him go. But also looking forward to working with Daniel.

Director Diversity Waiver

This was the significant topic to me, because, well… it was about me.

In June I switched employers, going back to Dell EMC. So far, I’ve been very happy, and it feels like I’ve gone bake home with the 14+ years between Compellent and Dell that I had prior to joining Huawei. Not that my time with Huawei wasn’t great. I think I learned a lot and had some opportunities to do things that I hadn’t done before, so no regrets.

But the catch with my going back to Dell was that they already have a Gold sponsor seat with Arkady Kanevsky and a community spot with Prakash Ramchandran.

The OpenStack Foundation Bylaws have a section (4.17) on Director Diversity. This clause limits the number of directors that can be affiliated with the same corporate entity to two. So even though Prakash and I are Individual Members (which means we are there as representatives of the community, not as representatives of our company), my move to Dell now violated that clause.

I think this was added to the bylaws back in the days where there were a few large corporate sponsors that had large teams of people dedicated to working on OpenStack. It was a safeguard to ensure no one company could overrun the Foundation based solely on their sheer number of people involved. That’s not quite as big of an issue today, but I do still think it makes sense. It is a very good thing to make sure any group like this has a diversity of people and viewpoints.

The bylaws actually explicitly state what should happen in my situation too - Article 4.17(d) states:

If a director who is an individual becomes Affiliated during his or her term and such Affiliation violates the Director Diversity Requirement, such individual shall resign as a director.

As such, I really should have stepped down on moving to Dell.

But luckily for me, there is also a provision called out in 4.17(e):

A violation of the Director Diversity Requirement may be waived by a vote of two thirds of the Board of Directors (not including the directors who are Affiliated)

This meant that 2/3 of the Board members present, not including any of us from Dell, would have to vote in favor of allowing me to continue out my term. If less than that were in favor, then I would need to step down. And presumably there would just be an open board seat for the rest of the term until elections are held again.

There was brief discussion, but I was very happy that everyone present did vote in favor of allowing me to continue out my term. I kind of feel like I should have stepped out during this portion of the call to make sure no one felt pressure by not wanting to say no in my presence, but hopefully that wasn’t the case for anyone. It was really nice get these votes, and some really good back channel support from non-board attendees listening in on the call.

What can I say - compliments and positive reinforcement go far with me. :)

So I’m happy to say I will at least be able to finish out my term for the rest of 2019. I will have to see about 2020. I don’t believe Arkady nor Prakash are planning on going anywhere, so we may need to have some internal discussions about the next election. Or, probably better, leave it up to the community to decide who they would like representing them for the Individual Director seats. Prakash has been doing a lot of great work for the India community, so if it came down to it and I lost to him, I would be just fine with that.

OW2 Associate Membership

Thierry then presented a proposal to join OW2 as an Assocaite Member. OW2 is “an independent, global, open-source software community”. So what does that mean? Basically, like the Open Source Initiative and others, they are a group of like-minded individuals, companies, and foundations that work together to support and further open source.

We (OpenStack) have actually worked with them for some time, but we had never officially joined as an Associate Member. There is no fee to join at this level, and it is really just formalizing that we are supportive of OW2’s efforts and willing to work with them and the members to help support their goals.

They have been in support of OpenStack and open infrastructure for years, so it was great to approve this effort. We are now listed as one of their Associate Members.

Interop WG Guidelines 2019.06

Egle moved to have the board approve the 2019.06 guidelines. We had held an email vote for this approval, but since we did not get reponses from every Directory, we now performed a vote in-meeting to record the voting. All present were in favor.

The interop guidelines are a way to make sure all OpenStack deployments conform to a base set of requirements. This makes sure that an end user of an OpenStack cloud has at least some level of assurance that they can move from one cloud to another and not getting a wildly different user experience. The work of the Interop Working Group has been very important to ensuring this stability and helping the ecosystem around OpenStack grow.

Miscellaneous

Prakash gave a quick update on the meetups and mini-Summits being organized in India. Sounds like a lot of really good activity happening in various regions. It’s great to see this community being supported and growing.

Alan also made a call for volunteers for the Finance and Membership committees. I had tried to get involved earlier in the year, but I think due to timing there really just wasn’t much going on at the time. With the next election coming up, and some changes in sponsors, now is actually a good time for the Membership Committee to have some more attention. I’ve joined Rob Esker to help review any new Platinum and Gold memberships. Sounds like we will have at least one new one of those coming up soon.

Summit Events

It wasn’t really an agenda topic for this Board Meeting, but I do think it’s worth pointing out here that the proposed changes to the structure of our yearly events have gone through and 2020 will start to diverge from the typical pattern we have had so far of holding to major Summits per year.

Erin Disney sent out a post about these changes to the mailing list. We will have a smaller event focused on collaboration in the spring, then a larger Summit (or Summit-like) event in later in the year.

With the maturity of OpenStack and where we are today, I really think this makes a lot more sense. There simply isn’t enough big new functionality and news coming out of the community today to justify two large marketing focused events like the Summit per year. What we really need now is to foster the environment to make sure the developers, operators, and others that are working on implementing new functionality and fixing bugs have the time and venue they need to work together and get things done. Having these smaller events and supporting more things like the regional Open Infrastructure Days will hopefully help keep that collaboration going and allow us to focus on the things that we need to do.

And the next event will be in beautiful Vancouver again, so that’s a plus!

by Sean McGinnis at October 17, 2019 12:00 AM

October 16, 2019

OpenStack Superuser

Zuul Community Answers Frequent Questions from AnsibleFest

The Zuul team recently attended Ansiblefest in Atlanta in September. We had the opportunity to meet loads of people who were excited to find out and learn more about Zuul. With that in mind, we compiled some of the most common questions we received, to help educate the public on Zuul, open source CI, and project gating.

If you’re interested in learning more about Zuul, check out this presentation that Ansible gave about how they put Zuul open source CI to use.

Now, let’s look at the questions we heard at the Zuul booth and throughout AnsibleFest.

How does Zuul compare to…

Jenkins

Zuul is purpose built to be a gating continuous integration and deployment system. Jenkins is a generic automation tool that can be used to perform continuous integration (CI) and continuous delivery (CD). Major differences that come out of this include:

  • Zuul expects all configuration to be managed as code
  • Zuul provides test environment provisioning via Nodepool
  • Zuul includes out of the box support for gating commits

Molecule

Molecule is a test runner for Ansible. It is used to test your Ansible playbooks. Zuul is an entire CI system that leverages Ansible to run its workloads. One of the workloads that Zuul can run for you is Molecule tests.

Tower

Zuul is designed to trigger continuous integration and deployment jobs based on events in your code review system. This means Zuul is coupled to actions like new commits showing up, reviewer approval, commits merging, git tags, and so on. Historically Tower has primarily been used to trigger Ansible playbooks via human inputs. Recently Tower has grown HTTP API support for GitHub and Gitlab webhooks. While these new features overlap with Zuul you will get a better CI experience through Zuul as it has more robust support for source control events and supports gating out of the box.

Is Zuul a replacement for Jenkins?

For some users, Zuul has been a replacement for Jenkins. Others use Zuul and Jenkins together. Zuul intends to be a fully featured CI and CD system that does not need Jenkins to operate. Additionally you do not need Jenkins to get any additional features in Zuul itself.

Does it work with GitLab or Bitbucket?

Since we received so many requests for Gitlab support at this year’s Ansiblefest, we put a call for help out on Twitter. We are happy to say that a Zuul driver to support GitLab as a code review input is in the early stages of development. Earlier interest in Bitbucket support, including at Ansiblefest 2018, has already led to a proof-of-concept driver some folks are using, which we hope to have in a release very soon.

Can I self host my Zuul?

Absolutely. Zuul is open source software which you are free to deploy yourself.

Can I run Zuul air gapped?

To function properly Zuul needs to talk to your code review system and some sort of test compute resources. As long as Zuul can talk to those (perhaps they are similarly air gapped), then running Zuul without external connectivity should be fine.

Is there a hosted version of Zuul I can pay for?

Yes, Vexxhost has recently announced their hosted Zuul product.

Can I pay someone for Zuul support?

You can use Vexxhost’s managed Zuul service. The Zuul community is also helpful and responsive and can be reached via IRC (#zuul on Freenode) or their mailing list (zuul-discuss@lists.zuul-ci.org).

What’s the catch? What’s your business model? How do you expect to make money at this? Is this project venture-capital backed? Are you planning an IPO any time soon?

Zuul is a community-developed free/libre open-source software collaboration between contributors from a variety of organizations and backgrounds. Contributors have a personal and professional interest in seeing Zuul succeed because they, their colleagues, and their employers want to use it themselves to improve their own workflows.

There is no single company backing the project, it’s openly-governed by a diverse group of maintainers and anyone who’s interested in helping improve Zuul is welcome to join the effort. Some companies do have business models which include running Zuul as a service or selling technical support for it, and so have an incentive to assist with writing and promoting the software, but they don’t enjoy any particular position of privilege or exert special decision-making power within the project.


If you’re interested in learning more about Zuul, check out our FAQ, read through our documentation, or test it yourself on a local install.

The post Zuul Community Answers Frequent Questions from AnsibleFest appeared first on Superuser.

by Clark Boylan, Jeremy Stanley, Paul Belanger and Jimmy McArthur at October 16, 2019 03:25 PM

October 15, 2019

Galera Cluster by Codership

Planning for Disaster Recovery (DR) with Galera Cluster (EMEA and USA webinar)

We talk a lot about Galera Cluster being great for High Availability, but what about Disaster Recovery (DR)? Database outages can occur when you lose a data centre due to data center power outages or natural disaster, so why not plan appropriately in advance?

In this webinar, we will discuss the business considerations including achieving the highest possible uptime, analysis business impact as well as risk, focus on disaster recovery itself, as well as discussing various scenarios, from having no offsite data to having synchronous replication to another data centre. 

This webinar will cover MySQL with Galera Cluster, as well as branches MariaDB Galera Cluster as well as Percona XtraDB Cluster (PXC). We will focus on architecture solutions, DR scenarios and have you on your way to success at the end of it.

EMEA webinar 29th of October 1-2 PM CEST (Central European Time)                 JOIN THE EMEA WEBINAR

USA webinar 29th of October 9-10 AM PDT (Pacific Daylight Time)                     JOIN THE USA WEBINAR

Presenter: Colin Charles, Codership

by Sakari Keskitalo at October 15, 2019 10:10 AM

StackHPC Team Blog

Kubeflow on Baremetal OpenStack

Kubeflow logo

DISCLAIMER: No GANs were harmed in the writing of the blog.

Kubeflow is a machine learning toolkit for Kubernetes. It aims to bring popular tools and libraries under a single umbrella to allow users to:

  • Spawn Jupyter notebooks with persistent volume for exploratory work.
  • Build, deploy and manage machine learning pipelines with initial support for the TensorFlow ecosystem but has since expanded to include other libraries that have recently gained popularity in the research communitity like PyTorch.
  • Tune hyperparameters, serve models, etc.

In our ongoing effort to demonstrate that OpenStack managed baremetal infrastructure is a suitable platform for performing cutting-edge science, we set out to deploy this popular machine learning framework on top of underlying Kubernetes container orchestration layer deployed via OpenStack Magnum. The control plane for the baremetal OpenStack cloud constitute of Kolla containers deployed using Kayobe which provides for containerised OpenStack to baremetal and is how we manage the vast majority of our deployments to customer sites. The justification for running baremetal instances is to minimise the performance overhead of virtualisation.

Apparatus

  • Baremetal OpenStack cloud (minimum Rocky) except for OpenStack Magnum (which must be at least Stein 8.1.0 for various reasons detailed later but critically in order to support Fedora Atomic 29 which addresses a CVE present in earlier Docker version).
  • A few spare baremetal instances (minimum 2 for 1 master and 1 worker).

Deployment Steps

  • Provision a Kubernetes cluster using OpenStack Magnum. For this step, we recommend using Terraform or Ansible. Since Ansible 2.8, os_coe_cluster_template and os_coe_cluster modules are available to support Magnum cluster template and cluster creation. However, in our case, we opted for Terraform which has a nicer user experience because it understands the interdependency between the cluster template and the cluster and therefore automatically determines the order in which they need to be created and updated. To be exact, we create our cluster using a Terraform template defined in this repo where the README.md has details of the how to setup Terraform, upload image and bootstrap Ansible in order to deploy Kubeflow. The key labels we pass to the cluster template are as follows:
cgroup_driver="cgroupfs"
ingress_controller="traefik"
tiller_enabled="true"
tiller_tag="v2.14.3"
monitoring_enabled="true"
kube_tag="v1.14.6"
cloud_provider_tag="v1.14.0"
heat_container_agent_tag="train-dev"
  • Run ./terraform init && ./terraform apply to create the cluster.
  • Once the cluster is ready, source magnum-tiller.sh to use tiller enabled by Magnum and run our Ansible playbook to deploy Kubeflow along with ingress to all the services (edit variables/example.yml to suit your OpenStack environment):
ansible-playbook k8s.yml -e @variables/example.yml
  • At this point, we should see a list of ingresses which use *-minion-0 as the ingress node by default when we run kubectl get ingress -A. We are using a nip.io based wildcard DNS service so that traffic generating from different subdomains map to various services we have deployed. For example, the Kubeflow dashboard is deployed as ambassador-ingress and the Tensorboard dashboard is deployed as tensorboard-ingress. Similarly, the Grafana dashboard deployed by placing monitoring_enabled=True label is deployed as monitoring-ingress. The mnist-ingress ingress is currently functioning as a placeholder for the next part where we train and serve a model using the Kubeflow ML pipeline.
$ kubectl get ingress -A
NAMESPACE    NAME                  HOSTS                           ADDRESS   PORTS   AGE
kubeflow     ambassador-ingress    kubeflow.10.145.0.8.nip.io                80      35h
kubeflow     mnist-ingress         mnist.10.145.0.8.nip.io                   80      35h
kubeflow     tensorboard-ingress   tensorboard.10.145.0.8.nip.io             80      35h
monitoring   monitoring-ingress    grafana.10.145.0.8.nip.io                 80      35h
git clone https://github.com/stackhpc/kubeflow-examples examples -b dell
cd examples/mnist && bash deploy-kustomizations.sh

Notes on Monitoring

Kubeflow comes with a Tensorboard service which allows users to visualise machine learning model training logs, model architecture and also the efficacy of the model itself by reducing the latent space of the weights in the final layer before the model makes a classification.

The extensibility of the OpenStack Monasca service also lends itself well to the integration into machine learning model training loops given that the agent is configured to accept non-local traffic on workers which can be done by setting the following values inside /etc/monasca/agent/agent.yaml and a restart of the monasca-agent.target service:

monasca_statsd_port: 8125
non_local_traffic: true

On the client side where the machine model example is running, metrics of interest can now be posted to the monasca agent. For example, we can provide a callback function to FastAI, a machine learning wrapper library which uses PyTorch primitives underneath with an emphasis on transfer learning (and can be launched as a GPU flavored notebook container on Kubeflow) for tasks such as image and natural language processing. The training loop of the library hooks into callback functions encapsulated within the PostMetrics class defined below at the end of every batch or at the end of every epoch of the model training process:

# Import the module.
from fastai.callbacks.loss_metrics import *
import monascastatsd as mstatsd

conn = mstatsd.Connection(host='openhpc-login-0', port=8125)

# Create the client with optional dimensions
client = mstatsd.Client(connection=conn, dimensions={'env': 'fastai'})

# Create a gauge called fastai
gauge = client.get_gauge('fastai', dimensions={'env': 'fastai'})

class PostMetrics(LearnerCallback):

    def __init__(self):
        self.stop = False

    def on_batch_end(self, last_loss, **kwargs:Any)->None:
        if self.stop: return True #to skip validation after stopping during training
        # Record a gauge 50% of the time.
        gauge.send('trn_loss', float(last_loss), sample_rate=1.0)

    def on_epoch_end(self, last_loss, epoch, smooth_loss, last_metrics, **kwargs:Any):
        val_loss, error_rate = last_metrics
        gauge.send('val_loss', float(val_loss), sample_rate=1.0)
        gauge.send('error_rate', float(error_rate), sample_rate=1.0)
        gauge.send('smooth_loss', float(smooth_loss), sample_rate=1.0)
        gauge.send('trn_loss', float(last_loss), sample_rate=1.0)
        gauge.send('epoch', int(epoch), sample_rate=1.0)

  # Pass PostMetrics() callback function to cnn_learner's training loop
  learn = cnn_learner(data, models.resnet34, metrics=error_rate, bn_final=True, callbacks=[PostMetrics()])

These metrics are sent to the OpenStack Monasca API which can then be visualised on a Grafana dashboard against GPU power consumption which can then allow a user to determine the tradeoff against model accuracy as shown in the following figure:

Kubeflow logo

In addition, general resource usage monitoring may also be of interest. There are two Prometheus based monitoring options available on Magnum:

  • First, non-helm based method uses prometheus_monitoring label which when set to True deploys a monitoring stack consisting of a Prometheus service, a Grafana service and a DaemonSet (Kubernetes terminology which translates to a service per node in the cluster) of node exporters. However, the the deployed Grafana service does not provide any useful dashboards that acts as an interface with the collected metrics due to a change in how default dashboards are loaded in recent versions of Grafana. A dashboard can be installed manually but it does not allow the user to drill down into the visible metrics further and presents the information in a flat way.
  • Second, helm based method (recommended) requires monitoring_enabled and tiller_enabled labels to be set to True. It deploys a similar monitoring stack as above but because it is helm based, it is also upgradable. In this case, the Grafana service comes preloaded with several dashboards that present the metrics collected by the node exporters in a meaningful way allowing users to drill down to various levels of detail and types of groupings, e.g. by cluster, namespace, pod, node, etc.

Of course, it is also possible to deploy a Prometheus based monitoring stack without having it managed by Magnum. Additionally, we have demonstrated that it is also as option to deploy the Monasca agent running inside of a container to post metrics to the Monasca API which may be available if it is configured to be the way to monitor the control plane metrics.

Why we recommend upgrading Magnum to Stein (8.1.0 release)

  • OpenStack Magnum (Rocky) supports up to Fedora Atomic 27 which is EOL. Support for Fedora Atomic 29 (with the fixes for the CVE mentioned earlier) requires a backport of various fixes from the master branch that reinstate support for the two network plugin types supported by Magnum (namely Calico and Flannel).
  • Additonally, there have been changes to the Kubernetes API which are outside of Magnum project's control. Rocky only supports the versions of Kubernetes upto v1.13.x and the Kubernetes project maintainers only actively maintain a development branch and 3 stable releases. The current development release is v1.17.x which means v1.16.x, v1.15.x and 1.14.x can expect updates and backport of critical fixes. Support for v1.15.x and v1.16.x are coming to Train release but upgrading to Stein will enable us to support up to v1.14.x.
  • The traefik ingress controller deployed by magnum is no longer working in Rocky release due to the fact the former behaviour was to always deploy the latest tag. However, a new major version (2.0.0) has been released with breaking changes to the API which inevitably fails. Stein 8.1.0 has the necessary fixes and additionally, also supports the more popular nginx based ingress controller.

Get in touch

If you would like to get in touch we would love to hear from you. Reach out to us via Twitter or directly via our contact page.

by Bharat Kunwar at October 15, 2019 10:00 AM

October 10, 2019

Aptira

10% off + FREE Consulting – FINAL DAYS!

Aptira 10 year birthday 10% off sale

Final days to claim our Birthday Special!

Incase you missed it, on the 9th of the 9th 2019, we turned 10! So until the 10th of the 10th, we’re offering 10% off all our services. That’s 10% off managed services, 10% off training, 10% off everything except hardware. This 10% discount also applies to pre-paid services, so you can pre-pay for the next 12 months to really maximise your savings!

We’re also offering a free 2 hour consulting session to help get you started with transforming your Cloud solution.

This offer is ending soon, so chat with a Solutionaut today to take advantage of this once in a decade discount and let us turn your business capabilities into a competitive advantage.

Let us make your job easier.
Find out how Aptira's managed services can work for you.

Find Out Here

The post 10% off + FREE Consulting – FINAL DAYS! appeared first on Aptira.

by Jessica Field at October 10, 2019 12:59 PM

OpenStack Superuser

From Containers to Edge Computing, Research Organizations Rely on Open Infrastructure

A lot separates Milan and Rome—including three hours by train—but one thing that connects these two cities is the open infrastructure community. 

The Italian community organizers—Binario Etico and Irideos— made two big changes to the local event this year. First, they renamed the OpenStack Day to OpenInfra Days to broaden the scope of the content at the event. They also planned two events this year in order to put the latest trends and user stories in front of as many local community members as possible. The events would not have been possible without the support of the event sponsors: D2iQ, GCI, Linux Professional Institute, OpenStack Foundation, and Mellanox. 

A combined crowd of over 300 attendees gathered in Milan and Rome last week at the OpenInfra Days Italy to hear how organizations are building and operating open infrastructure. 

Mariano Cunietti and Davide Lamanna kicked off both events explaining how important it is for European organizations to embrace open source components and cross community collaboration.

“It’s the way we collaborate and the way we shape communication flow that works,” Cunietti said. “Collaborative open source is a way to shift technicians from being consumers to participants and citizens of the community. This is a very important shift.” 

From a regional perspective, Lamanna explained how European standards and privacy laws create a requirement that have given local, open source organizations a competitive advantage around interoperability and flexibility features.   

To exemplify the power of open infrastructure and community collaboration in Europe, several users shared their production stories. An industry that is very pervasive in Europe—particularly Italy—is research. 

  • GARR: Saying that no infrastructure is open until you open it, GARR harmonizes and implements infrastructure for the benefit of the scientific community in Italy—amassing to around 4 million users.  Alex Barchiesi shared some stats around GARR’s current OpenStack deployment—8,500 cores with 10 PB of storage in five data centers across three regions—as well as their approach to identity federation. GARR’s concept of federation: the simpler, the better; the less requirements, the more inclusive. With their multi-region, multi-domain model, Barchiesi explained how they have architected a shared identity service. To give back to the community, the GARR team contributes upstream to OpenStack Horizon, k8s-keystone auth, and juju charms. 
  • The Istituto Nazionale di Fisica Nucleare (INFN)—an Italian public research institute for high energy physics (who also collaborates with CERN!)—has a private cloud infrastructure that is OpenStack-based and geographically distributed in three major INFN data centers in Italy. The adoption of Ceph as distributed object storage solution enables INFN to provide both local block storage in each of the interested sites and a ready-to-use disaster recovery solution implemented among the same sites. Collectively, the main data centers have around 50,000 CPU cores, 50 PB of enterprise-level disk space, and 60 PB of tape storage.  
  • While CERN is not based in Italy, their OpenStack and Kubernetes use case provides learnings around the world. Jan van Eldik shared updated stats around CERN’s open infrastructure environment with focuses on OpenStack Magnum, Ironic and Kubernetes. CERN by the numbers: more than 300,000 OpenStack cores, 500 Kubernetes clusters, and 3,300 servers managed by OpenStack Ironic (expected to be 15,000 in the next year). 

Outside of the research sector, other users who shared their open infrastructure story include the city government of Rome’s OpenStack use case, Sky Italia’s creation of a Kubernetes blueprint and network setup that empowers their brand new Sky Q Fibra service, and the SmartME project that is deploying OpenStack at the edge for smart city projects in four cities across Italy. 

What’s next for the open infrastructure community in Italy? Stay tuned on the OpenStack community events page for deadlines and event dates. 

Can’t wait until 2020? Join the global open infrastructure community at the Open Infrastructure Summit Shanghai from November 4-6.

Cover photo courtesy of Frederico Minzoni.

The post From Containers to Edge Computing, Research Organizations Rely on Open Infrastructure appeared first on Superuser.

by Allison Price at October 10, 2019 12:00 PM

October 09, 2019

Mirantis

SUSE OpenStack is no more — but Don’t Panic

SUSE has announced they're discontinuing their OpenStack distro, but it's not the end of the line for their customers.

by Nick Chase at October 09, 2019 08:27 PM

Aptira

Open Source Networking Days Australia

Open Source Networking Days Australia

Coming Soon: Open Source Networking Days Australia

Open Source Networking Day Australia is a one-day mini-summit hosted by Telstra and co-organized by LF Networking (LFN) and Aptira.

This is the first time that LFN has brought an open source networking event to Australia and it will be a unique opportunity to connect and collaborate with like-minded community members that are passionate about open source networking. The event will bring together service providers, the developer community, industry partners and academia for a day of collaboration and idea exchange on all things related to open-source networking, including LF Networking (LFN) projects like ONAP, OpenDaylight, Tungsten Fabric and Open Networking Foundation (ONF) projects like COMAC, Stratum, ONOS and P4, as well as home-grown innovation such as OpenKilda and many more.

To make open source networking viable in Australia, we need to collectively grow awareness, skills and investment. By attending this event, attendees will learn about the state of open source networking adoption globally and locally, how open source is applied in network automation, evolution of software defined networking and how open source enables exciting use cases in edge computing. Attendees will have plenty of opportunities to interact with global experts, industry peers and developers via keynote sessions, panel Q&A, technical deep-dives and business discussions, and more importantly learn how to get involved in open source networking communities going forward. Registration is free, so register today and we hope to see you in Melbourne!

Melbourne, Australia | November 11, 2019
8:30 am – 5:00 pm
Telstra Customer Insight Centre (CIC)
Tickets

In addition to this, there will also be a Next-Gen SDN Tutorial hosted on the 12th of November.

Next-Gen SDN is delivering fine-grained network programmability with zero touch configuration and management, enabling operators’ complete control of their networks. Leveraging P4, P4Runtime, OpenConfig/gNMI and gNOI, NG-SDN is now truly delivering on the ‘software defined’ promise of SDN for future transformation, new applications and unprecedented levels of new value creation.

This tutorial is an opportunity for architects and engineers to learn the basics and to practically experiment with some of the building blocks of the NG-SDN architecture, such as:

  • P4 language
  • Stratum (P4Runtime, OpenConfig over gNMI, gNOI)
  • ONOS

The goal of the tutorial is to answer questions such as:

  • What is P4 and how do I use it?
  • How do I go from a P4 program to a complete network solution?
  • What is Stratum and how can I use its interfaces to control packet forwarding, configure ports, or push software upgrades to my network devices?
  • How can I use ONOS to write control-plane apps for my P4 program?

It is organized around a sequence of introductory presentations, as well as hands-on exercises that show how to build a leaf-spine data center fabric from scratch based on IPv6 using P4 and ONOS.

The tutorial will include an introduction to the P4 language, Stratum, and ONOS. Participants will be provided with starter P4 code and an ONOS app implementation, along with instructions to run a Mininet-emulated leaf-spine topology of Stratum-enabled software switches. Only basic programming and networking knowledge is required to complete the hands-on exercises. Knowledge of Java and Python will be helpful to understand some of the starter code.

Registrations for the tutorial are limited to 50 people, so to secure your place register now.

Open Source Networking Days Australia Sponsors

Ready to move your network into the software defined future?
Automate your network with ONAP.

Find Out How

The post Open Source Networking Days Australia appeared first on Aptira.

by Jessica Field at October 09, 2019 12:12 PM

October 08, 2019

OpenStack Superuser

Open Infrastructure in Germany: Hitting the Road with New and Growing OpenStack Use Cases

A year after we held the OpenStack Summit Berlin, it was great to return to Berlin to see what has changed—hear how OpenStack users had grown their deployments since we last saw them, finding new users sharing their stories, and hearing how companies are integrating open infrastructure projects in innovative ways.

Europe’s first music hotel with photos on the wall of the musicians who have visited in years past welcomed a new audience: 300 Stackers for the 2019 OpenStack Day DOST in Berlin. Community members gathered for two days of breakout sessions, sponsor demos, and waterfront views. Sessions and an evening event cruise along the Spree River were made possible by event organizers and sponsors: B1 Systems, Canonical, Netways Web Services, Noris Network, the OpenStack Foundation, Open Telekom Cloud, Rancher, and SUSE.

In addition to being home to a diverse set of ecosystem vendors, German roads are also home to automakers who rely on OpenStack including Audi and BMW who shared their use cases with conference attendees.

BMW first shared its OpenStack story at the Paris Summit in 2014 and since then, has continued to grow its OpenStack footprint rapidly. Currently sitting at 700 servers, they are expecting their environment to grow by an additional 300 by the end of the year. As of today, almost 400 projects and platforms (rising steadily) rely on their dynamic, flexible and tailor-made instance of the OpenStack at the BMW Group, including autonomous driving.

Andreas Poëschl showing how BMW has grown its OpenStack environment over the years.

Audi was the second automaker of the conference to share its open infrastructure use case, powered by OpenStack and Ceph. Audi AG’s shop floor IT environment is designed for uninterrupted, highly available 24/7 operation, and these requirements make it difficult to test new, not yet evaluated technologies close to production. To quickly bring these technologies into production and make them available, the Audi Production Lab was founded. There, it is possible to incorporate the latest concepts and develop them to the point where they meet the requirements of production.

Through the construction of a self-sufficient, decoupled, independently usable, flexible, and adaptable server infrastructure based on Ceph and OpenStack in the Production Lab, it is now possible to evaluate innovative technologies such as Kubernetes and bring them to production in a timely manner.

Auto makers were not the only ones sharing their open infrastructure integration story.

  • SAP shared its Converged Cloud where the basis is OpenStack orchestrated in a Kubernetes cluster. With the newly developed Kubernikus module, the Converged Cloud enables SAP to offer its customers Kubernetes-as-a-Service, which is provided as a one-button self-service. Kubernikus creates a Kubernetes cluster that operates as a managed service and can be offered for API support. Kubernikus works with the OpenStack API and remains 100% Kubernetes and Open Source. The structure allows the separate operation of Kubernetes API and project-specific nodes.  
  • The Open Telekom Cloud, the public cloud service of Deutsche Telekom, is one of the local members of the OpenStack 100k core club. With over a quarter of a million managed CPU cores, it’s one of the largest fully managed and managed clouds in Europe. Their team presented the DevOps model that enables their OpenStack-powered public cloud to continue to grow. 

What’s next for the open infrastructure community in Germany? The event organizers say the planning for the 2020 event in Hamburg is underway. Stay tuned on the OpenStack community events page for deadlines and event dates. 

Can’t wait until 2020? Join the global open infrastructure community at the Open Infrastructure Summit Shanghai November 4-6. 

Cover photo courtesy of NETWAYS Web Services.

The post Open Infrastructure in Germany: Hitting the Road with New and Growing OpenStack Use Cases appeared first on Superuser.

by Allison Price at October 08, 2019 01:00 PM

October 07, 2019

Mirantis

How to deploy Airship in a Bottle: A quick and dirty guide

Airship in a Bottle is a simple way to create an Airship deployment that includes a compact OpenStack cluster.

by Nick Chase at October 07, 2019 12:55 PM

October 05, 2019

Aptira

Real-World Open Networking. Part 5: Dissonance between Networks and Software Domains

Real-World Open Networking. Part 5: Dissonance between Networks and Software Domains

In our last post we finished up a detailed examination of different aspects of Interoperability. In this post, we will analyse the different mindsets between traditional networking domains and software development domains, explain why there is often built-in dissonance.

Background

Whilst Open Network solutions require the integration of network and software components and practices, at the current time (and historically) these two domains are largely incompatible. Unless carefully managed this incompatibility will cause project stress and impairment.

Given that many (if not most) Open Network solutions originate in the Network Engineering department within a user organisation, this is an important consideration for the entire lifecycle of the solution; especially so if the Network Engineering team does not have established software skills and experience.

The Problem

There are many aspects of the types of dissonance that can be experienced in an Open Networking project due to different paradigms or mindsets. Below we cover the top four aspects of the problem:

  • Design & Production paradigm conflicts
  • Ability to Iterate
  • End user Engagement
  • Expectations of Interoperability

Expectation on Development

We described in Software Interlude Part 6 – Development Paradigms that traditional network engineering aligns more with the production model of work, i.e. that the design and production processes are largely serialised and separate.

Software development on the other hand operates on a different paradigm, in which design and production are largely intermingled: not just parallel but intertwined within the same team and the same resources.

Networks (in general) are designed using discrete components and can be designed and built along fairly pre-determined and predictable steps guided by engineering principles. Networks are highly mechanical and mathematical in nature, following a well-established set of rules. Even the software components of traditional network equipment (configuration) followed rules backed up by years of mathematical research. Network designs can be validated in advance using the same techniques.

Practically, we see the implications of this in the way network projects are executed. Formally, network projects are far more of a plan-based (aka Waterfall) lifecycle model. There are many logical reasons why the plan-based approach is better for this type of project.

Informally, we also see this: it’s typical that a senior, more experienced, person will do the network design and create a specification for how the network is to be built. This network design is typically handed off to other technical personnel for the build.

Expectations on the ability to iterate

Flexibility is a key aspect of software development projects: it underpins everything that a software developer does and thinks. Networks appear to value other things: integrity, security etc. The difference comes down to the relative size of Increments, prototypes and/or MVP’s. Note: the MVP (Minimum Viable Product) is the smallest component that can be deployed to production and which enables at least 1 valuable use case.

Small increments in functionality, prototypes and MVP’s are important parts of the solution development process. These all support the agile principles if inspect and adapt.

For software, these increments can be very small and be produced very rapidly. Traditionally, in the network domain, creating a small instance of some aspect of a solution has a much higher hurdle. Model labs or test environments may exist, but these are typically insufficient for the dynamic changes required by the need to iterate; that is, if they are available at all, and/or have the right or sufficient quantities of hardware.

Expectations on End User Engagement

It is not uncommon for networks projects to be built to very general requirements and not to specific end-user use cases. The logical flow-on from this is that end-users are not actively engaged in the development lifecycle.

Software projects, and in particular Agile software projects, are built on engagement with end-users: the expectation is that end-users will interact with developers on a daily basis. This requires certain skillsets that are well-developed in software engineers (well, to varying degrees), but few Network engineers have this experience.

Expectations of Interoperability

In general, network developers have a much higher expectation on out of the box interoperability than software developers, notwithstanding the softwareisation of the networks.

Experienced software developers typically have a high level of scepticism when it comes to claims of interoperability and will naturally plan in validation process to ensure they understand how the product will actually work. Network engineers and architects appear to be more ready accept claims of operability or standards compliance and don’t necessarily prepare for validation processes, except for first time onboarding of equipment into a network.

But given the different natures of the products, an initial validation for a software product can have a relatively short life (as new updates can break this tested functionality), whereas initial validation of a hardware product has a much longer life.

Conclusion

The existence of these sources of dissonance, and more, can easily lead to project impairment if not anticipated and managed carefully.

In both project planning and execution, problems arise when one party wants to invest time into something (e.g. risk reserves or validation testing) that the other party doesn’t see the need for (and consequently believes is unjustified padding of the estimates) or just doesn’t get leading to misunderstanding and miscommunication.

How do we manage this effectively? We treat everything as a software project.

Let us make your job easier.
Find out how Aptira's managed services can work for you.

Find Out Here

The post Real-World Open Networking. Part 5: Dissonance between Networks and Software Domains appeared first on Aptira.

by Adam Russell at October 05, 2019 01:20 PM

OpenStack Superuser

OpenStack Ironic Bare Metal Program case study: VEXXHOST

The OpenStack Foundation announced in April 2019 that its Ironic software is powering millions of cores of compute all over the world, turning bare metal into automated infrastructure ready for today’s mix of virtualized and containerized workloads.

Some 30 organizations joined for the initial launch of the OpenStack Ironic Bare Metal Program, and Superuser is running a series of case studies to explore how people are using it.

VEXXHOST provides high performance, cloud computing solutions that are cost conscious, complete, and widely flexible. In 2011, VEXXHOST adopted OpenStack software for its infrastructure. Since then, VEXXHOST has been an active contributor and an avid user of OpenStack. Currently, VEXXHOST provides infrastructure-as-a-service OpenStack public cloud, private cloud, and hybrid cloud solutions to customers, from small businesses to enterprises across the world.

Why did you select OpenStack Ironic for your bare metal provisioning in your product?

VEXXHOST has a long history of involvement with OpenStack technology, dating back to the Bexar release. We have since been powering all of our infrastructures using OpenStack. Taking advantage of Ironic for our bare metal provisioning seemed a natural next step in the continuous building out of our system and Ironic fit right in with each of our components, integrating easily with all of our existing OpenStack services.

As we offer multiple architectures, enterprise-grade GPUs, and various hardware options, the actual process of testing software deployments can pose a real challenge when it comes to speed and efficiency. However, we knew that choosing Ironic would resolve these difficulties, with the benefits being passed on to our users, in addition to enabling us to provide them with the option of deploying their private cloud on high-performing bare metal.

What was your solution before implementing Ironic?

Before VEXXHOST implemented OpenStack Ironic, we were using a system that we had built internally. For the most part, this system provided an offering of services that Ironic was already delivering on so it made sense to adopt it as opposed to maintaining our smaller version.

What benefits does Ironic provide your users?

Through Ironic, VEXXHOST’s users have access to fully dedicated and secure physical machines that can live in our data centres or theirs. Due to its physical and dedicated nature, the security provided by bare metal relieves VEXXHOST’s users of any risks associated with environment neighbours and thanks to the isolation factor, users are ensured that their data is never exposed to others. Ironic can also act as an automation tool for the centralized housing and management of all their machines and even enables our users to access certain features that aren’t available in virtual machines, like having multiple levels of virtual machines.

Additionally, VEXXHOST’s users benefit from Ironic’s notably simpler configuration and less complex set-up when compared to virtual machines. Where use cases require it, Ironic can also deliver to our users a higher level of performance than virtual machines. Through the region controller, our users benefit from high availability starting at the data center level and users are able to create and assign physical availability zones to better control critical availability areas. Through the use of Ironic, VEXXHOST can easily run any other OpenStack projects and configure our user’s bare metal specifically for their use cases. Ironic is also easily scaled from a few servers to multiple racks within a data centre and through their distributed gateways, makes it possible to process large parallel deployments. By using OpenStack technology, like Ironic, VEXXHOST ensures that users are never faced with the risks associated with vendor lock-in.

What feedback do you have to provide to the upstream OpenStack Ironic team?

Through our long-standing involvement with the OpenStack Community, based on VEXXHOST’s contributions and our CEO Mohammed Naser‘s role as Ansible PTL and member of the Technical Committee, we regularly connect with the Ironic team and have access to their conversations. Currently, there isn’t any feedback that we haven’t already shared with them.

Learn more

You’ll find an overview of Ironic on the project Wiki.
Discussion of the project takes place in #openstack-ironic on irc.freenode.net. This is a great place to jump in and start your ironic adventure. The channel is very welcoming to new users – no question is a wrong question!

The team also holds one-hour weekly meetings at 1500 UTC on Mondays in the #openstack-ironic room on irc.freenode.netchaired by Julia Kreger (TheJulia) or Dmitry Tantsur (dtantsur).

Stay tuned for more case studies from organizations using Ironic.

Photo // CC BY NC

The post OpenStack Ironic Bare Metal Program case study: VEXXHOST appeared first on Superuser.

by Superuser at October 05, 2019 01:00 PM

October 04, 2019

Chris Dent

Fix Your Debt: Placement Performance Summary

There's a thread on the openstack-discuss mailing list, started in September and then continuing in October, about limiting planned scope for Nova in the Ussuri cycle so that stakeholders' expectations are properly managed. Although Nova gets a vast amount done per cycle there is always some stuff left undone and some people surprised by that. In the midst of the thread, Kashyap points out:

I welcome scope reduction, focusing on fewer features, stability, and bug fixes than "more gadgetries and gongs". Which also means: less frenzy, less split attention, fewer mistakes, more retained concentration, and more serenity. [...] If we end up with bags of "spare time", there's loads of tech-debt items, performance (it's a feature, let's recall) issues, and meaningful clean-ups waiting to be tackled.

Yes, there are.

When Placement was extracted from Nova, one of the agreements the new project team made was to pay greater attention to tech-debt items, performance, and meaningful clean-ups. One of the reasons this was possible was that by being extracted, Placement vastly limited its scope and feature drive. Focused attention is easier and the system is contained enough that unintended consequences from changes are less frequent.

Another reason was that for several months my employer allowed me to devote effectively 100% of my time to upstream work. That meant that there was long term continuity of attention in my work. Minimal feature work combined with maximal attention leads to some good results.

In August I wrote up an analysis of some of that work in Placement Performance Analysis, explaining some of the things that were learned and changed. However that analysis was comparing Placement code from the start of Train to Train in August. I've since repeated some of the measurement, comparing:

  1. Running Placement from the Nova codebase, using the stable/stein branch.
  2. Running Placement from the Placement codebase, using the stable/stein/ branch.
  3. Running Placement from the Placement codebase, using master, which at the moment is the same as what will become stable/train and be released as 2.0.0.

The same database (PostgreSQL) and web server (uwsgi using four processes of ten threads each) is used with each version of the code. The database is pre-populated with 7000 resource providers representing a suite of 1000 compute hosts with a moderately complex nested provider topology that is similar to what might be used for a virtualized network function.

The same query is used, whatever the latest microversion is for that version:

http://ds1:8000/allocation_candidates? \
                resources=DISK_GB:10& \
                required=COMPUTE_VOLUME_MULTI_ATTACH& \
                resources1=VCPU:1,MEMORY_MB:256& \
                required1=CUSTOM_FOO& \
                resources2=FPGA:1& \
                group_policy=none

(This is similar to what is used in the nested-perfload peformance job in the testing gate, modified to work with all available microversions.)

Here are some results, with some discussion after.

10 Serial Requests

Placement in Nova (stein)

Requests per second:    0.06 [#/sec] (mean)
Time per request:       16918.522 [ms] (mean)

Extracted Placement (stein)

Requests per second:    0.34 [#/sec] (mean)
Time per request:       2956.959 [ms] (mean)

Extracted Placement (train)

Requests per second:    1.37 [#/sec] (mean)
Time per request:       730.566 [ms] (mean)

100 Requests, 10 at a time

Placement in Nova (stein)

This one failed. The numbers say:

Requests per second:    0.18 [#/sec] (mean)
Time per request:       56567.575 [ms] (mean)

But of the 100 requests, 76 failed.

Extracted Placement (stein)

Requests per second:    0.41 [#/sec] (mean)
Time per request:       24620.759 [ms] (mean)

Extracted Placement (train)

Requests per second:    2.65 [#/sec] (mean)
Time per request:       3774.854 [ms] (mean)

The improvement between the versions in Stein (16.9s to 2.9s per request) were mostly made through fairly obvious architecture and code improvments found by inspection (or simply knowing it was not ideal when first made, and finally getting around to fixing it). Things like removing the use of oslo versioned objects and changes to cache management to avoid redundant locks.

From Stein to Train (2.9s to .7s per request) the improvements were made by doing detailed profiling and benchmarking and pursuing a very active process of iteration (some of which is described by Placement Performance Analysis).

In both cases this was possible because people (especially me) had the "retained concentration" desired above by Kashyap. As a community OpenStack needs to figure out how it can enshrine and protect that attention and the associated experimentation and consideration for long term health. I was able to do it in part because I was able to get my employer to let me and in part because I overcommitted myself.

Neither of these things are true any more. My employer has called me inside, my upstream time will henceforth drop to "not much". I'm optimistic that we've established a precedent and culture for doing the right things in Placement, but it will be a challenge and I don't think it is there in general for the whole community.

I've written about some of these things before. If the companies making money off OpenStack are primarily focused on features (and being disappointed when they can't get those features into Nova) who will be focused on tech-debt, performance, and meaningful clean-ups? Who will be aware of the systems well enough to effectively and efficiently review all these proposed features? Who will clear up tech-debt enough that the systems are easier to extend without unintended consequences or risks?

Let's hit that Placement performance improvement some more, just to make it clear:

In the tests above, "Placement in Nova (stein)" failed with a concurrency of 10. I wanted to see at what concurrency "Extracted Placement (train)" would fail: At 150 concurrency of 1000 requests some requests fail. At 140 all requests work, albeit slow per request (33s). Based on the error messages seen, the failing at 150 is tied to the sizing and configuration of the web server and nothing to do with the placement code itself. The way to have higher concurrency is to have more or larger web servers.

Remember that the nova version fails at concurrency of 10 with the exact same web server setup. Find the time to fix your debt. It will be worth it.

by Chris Dent at October 04, 2019 01:32 PM

OpenStack Superuser

OpenStack Ironic Bare Metal Program case study: China Mobile

The OpenStack Foundation announced in April 2019 that its Ironic software is powering millions of cores of compute all over the world, turning bare metal into automated infrastructure ready for today’s mix of virtualized and containerized workloads.

Over 30 organizations joined for the initial launch of the OpenStack Ironic Bare Metal Program, and Superuser is running a series of case studies to explore how people are using it.

China Mobile is a leading telecommunications services provider in mainland China. The Group provides full communications services in all 31 provinces, autonomous regions and directly-administered municipalities throughout Mainland China and in Hong Kong Special Administrative Region.

In 2018, the company was again selected as one of “The World’s 2,000 Biggest Public Companies” by Forbes magazine and Fortune Global 500 (100) by Fortune magazine, and recognized for three consecutive years in the global carbon disclosure project CDP’s 2018 Climate A List as the first and only company from Mainland China.

Why did you select OpenStack Ironic for bare metal provisioning in your product?

China Mobile has a large number of businesses running on various types of architectures such as x86 and power servers, which provide high quality services to our business and customers. This large number continues to increase every year by more than 100,000. Recently we have built several cloud solutions based on OpenStack as Gold Members of the OpenStack Foundation. Therefore, our public cloud and private cloud solutions are compatible with OpenStack. Ironic focuses on the compute, storage and network resources that are matched with OpenStack, which is the core requirement of China Mobile’s bare metal cloud.

In addition, China Mobile’s physical IaaS solution has multiple types of vendor hardware and solutions. By OpenStack Ironic’s improved architecture design and rich plug-in commits, we can learn from reliable experiences from the community during building our service process.

What was your solution before implementing Ironic?

Before the promotion of Ironic, the best automation method we used was the PXE + ISO + kickstart to achieve the relevant requirements. Due to its limitations in the network, storage and even operating system compatibility, we would manually manipulate all the processes in it. At the same time, due to the lack of relevant service data at the management level, workflow data could not be recorded well nor transferred in the course of work, reducing the delivery efficiency greatly.

What benefits does Ironic provide your users?

The biggest benefit of Ironic for us or our users is it can increase the efficiency of server delivery. Originally what took a day or even weeks, now takes half an hour to one hour. Based on Ironic, users can choose more operating systems they need, even on arm-linux. The network resources they wanted, such as Virtual Private Cloud (VPC), Load Balancer (LB), Firewall (FW) which can be freely configured through the combination of Ironic and Neutron. The same as Ironic and Cinder, the combination can provide users with Boot From Volume (BFV) and other disk array management or configuration capabilities. In short, through Ironic, we can deliver a complete compute, network and storage server through a top-down process without the operation and maintenance or user-responsible information synchronization and manual configuration.

With Ironic, we built a platform for data center administrators to redefine the access standards of the different hardware they manage. By this, all hardware vendors should comply with the management and data transmission protocol or they should be pushing the plug-in to OpenStack. Then, for administrators, they can focus on management and user serve.

For China Mobile, hardware management or server os delivery sometimes is not enough. We are extending our bare metal cloud to support applications integrated by OpenStack Mistral and Ansible. All in all, we are continuously improving the ecology from Ironic to save our users time.

What feedback do you have to provide to the upstream OpenStack Ironic team?

We hope that Ironic has an operating system agent solution, like qemu guest agent.

Learn more

You’ll find an overview of Ironic on the project Wiki. Discussion of the project takes place in #openstack-ironic on irc.freenode.net. This is a great place to jump in and start your Ironic adventure. The channel is very welcoming to new users – no question is a wrong question!

The team also holds one-hour weekly meetings at 1500 UTC on Mondays in the #openstack-ironic room on irc.freenode.net chaired by Julia Kreger (TheJulia) or Dmitry Tantsur (dtantsur).

Stay tuned for more case studies from organizations using OpenStack Ironic.

 

Photo // CC BY NC

The post OpenStack Ironic Bare Metal Program case study: China Mobile appeared first on Superuser.

by Superuser at October 04, 2019 01:00 PM

Aptira

Real-World Open Networking. Part 4 – Interoperability: Problems with API’s

Real-World Open Networking. Part 4 - Interoperability. Problems with API's

In our last post we looked at different general patterns of standards compliance in Open Network solutions. In this post we drill down another layer to look at interoperability at the Application Program Interface (API) level, which creates issues at a level beyond standards.

Background

As we’ve mentioned previously, network equipment has been focused on interface compatibility and interoperability for many decades and has a history of real interoperability success. Traditional networks exposed communications interfaces and most of the standards for network equipment focus on these interfaces.

But with the advent of network software equivalents to hardware devices, we open up new areas for problems.

Software components may implement the same types of communications interfaces, but also will provide Application Program Interfaces (API’s) for interaction between itself and other software components. These API’s may be the subject of standards, and thus the issues raised in previous article may apply. Or they may be simply proprietary API’s, unique to the vendor.

So we need to take a look at how API’s can support interoperability and also the problems that occur in API implementation that make interoperability more challenging.

API Interoperability

There are a number of levels at which API’s are open and potentially interoperable, or not.

  • Availability of the specification and support by the vendor of third-party implementation (standard or proprietary)
  • Level of compliance with any documentation (standardised or not)
  • Ability of the underlying components to satisfy the exposed API

Previously, we covered the different degrees of compliance and the obstacles that this put in the way of successful Open Network solutions. In this post we’ll elaborate on the other two only.

Availability of the Interface Specification

Open Standards specifications are generally available, but often not freely available. Some organisations restrict specifications to varying levels of membership of their organisation. Sometimes only paid members can access the specifications.

Proprietary interfaces may be available under certain limited conditions or they may not be available at all. Availability is usually higher for de facto standards, because it enables the standards owner to exert some influence over the marketplace. Highly proprietary interfaces often have higher hurdles to obtain access, typically only if an actual customer requests the specification for itself or on behalf of a solution integrator.

Practical Accessibility in a Project

It’s one thing to get access to an API specification document, but its very much another to gain practical accessibility to the information necessary to implement an interface to that API.

An Open Network solution may have hundreds of API’s in its inventory of components, or more. These API’s must be available for use by the solution designers. A typical solution is to publish these API’s in a searchable catalog. This might be ‘open’ in one sense, but not necessarily Interoperable.

Solution integrators must also have access to a support resource to help with issues arising from the implementation (bugs, etc). It is far too common for the API document to be of limited detail, inaccurate, and even out-of-date. The richness of this support resource and the availability of live support specialists will directly translate to implementation productivity.

Ability of the Underlying Components to Satisfy the API

Software has a number of successes at implementing syntactic and representational openness but not semantic openness. Using the REST standard as an example, I can post a correctly formatted and encoded payload to a REST endpoint, but unless the receiving application understands the semantic content then the interface doesn’t work.

And if the underlying components cannot service the request in a common (let alone standard) way, theoretical interoperability becomes difficult and/or constrained.

An NFV example may help.

Consider an NFV Orchestration use case that performs auto-scaling of NFV instances based on some measure of throughput against capacity. Most NFV components make it easy to obtain the required measures of the relevant metric via telemetry.

But it is the range of available metrics and the algorithms used to generate the metrics that introduces complexity and potentially impacts Interoperability.

One NFV vendor might provide this measure in terms of CPU utilisation at a total NFV level. Another might provide the CPU utilisation at a VM level. Or vendors may use different algorithms for calculating the metric that they call “CPU Utilisation” or may vary considerably in the timing of updates. Another vendor might not provide CPU utilisation at all but may provide a metric of packets per second.

Conclusion

API’s play a significant role in the implementation of Open Network solutions and the achievement of interoperability. However, they are not a “silver bullet” and there can be many challenges. As with Standards compliance, API availability, and potentially compliance with a standard, cannot be assumed.

In the last few posts we’ve focused on software-related topics, but it’s time to bring back the Networking side of Open Networking for our last two posts. Leaving technology aside for the moment, how does a solution integrator deal with the different paradigms for solution implementation that can exist in an Open Networking project? We’ll cover that in the next post.

Stay tuned.

Ready to move your network into the software defined future?
Automate your network with ONAP.

Find Out How

The post Real-World Open Networking. Part 4 – Interoperability: Problems with API’s appeared first on Aptira.

by Adam Russell at October 04, 2019 04:41 AM

October 03, 2019

RDO

RDO is ready to ride the wave of CentOS Stream

The announcement and availability of CentOS Stream has the potential to improve RDO’s feedback loop to Red Hat Enterprise Linux (RHEL) development and smooth out transitions between minor and major releases. Let’s take a look at where RDO interacts with the CentOS Project and how this may improve our work and releases.

RDO and the CentOS Project

Because of tight coupling with the operating system, RDO project joined the CentOS SIGs initiative from the beginning. CentOS SIGs are smaller groups within the CentOS Project community focusing on a specific area or software type. RDO was a founding member of the CentOS Cloud SIG that is focusing on cloud infrastructure software stacks and is using the CentOS Community BuildSystem (CBS) to build final releases.

In addition to Cloud SIG OpenStack repositories, during release development RDO Trunk repositories provide packages for new commits in OpenStack projects soon after they are merged upstream. After commits are merged a new package is created and a YUM repository is published in RDO Trunk server, including this new package build and the latest builds for the rest of packages in the same release.This enables packagers to identify packaging issues almost immediately after they are introduced, shortening the feedback loop to the upstream projects.

How CentOS Stream can help

A stable base operating system, on which continuously changing upstream code is built and tested, is a prerequisite. While CentOS Linux did come close to this ideal, there were still occasional changes in the base OS that were breaking OpenStack CI, especially after a minor CentOS Linux release where it was not possible to catch those changes before they were published.

The availability of rolling-release CentOS Stream, announced alongside CentOS Linux 8,  will help enable our developers to provide earlier feedback to the CentOS and RHEL development cycles before breaking changes are published. When breaking changes are necessary, it will help us adjust for them ahead of time.

A major release like CentOS Linux 8 is even more of a challenge, RDO has managed to transition from EL6 to EL7 during the OpenStack Icehouse cycle by doing two distributions in parallel – five years ago, with a much smaller package set than it is now.

For the current OpenStack Train release in development, the RDO project started preparing for the Python 3 transition using Fedora 28, which helped to get this huge migration effort going, at the same time it was only a rough approximation for RHEL 8/CentOS Linux 8 and required complete re-testing on RHEL.

Since CentOS Linux 8 is released very closely to the OpenStack Train release, the RDO project will be able to provide RDO Train initially only on EL7 platform and will add CentOS Linux 8 support in RDO Train soon after.

For the future releases, the RDO project is looking forward to be able to start testing and developing against CentOS Stream updates as they are developed, to provide feedback, and help stabilize the base OS platform for everyone!

About The RDO Project

The RDO project is providing a freely-available, community-supported distribution of OpenStack that runs on Red Hat Enterprise Linux (RHEL) and its derivatives, such as CentOS Linux. RDO also makes the latest OpenStack code available for continuous testing while the release is under development.

In addition to providing a set of software packages, RDO is also a community of users of cloud computing platforms on Red Hat-based operating systems where you can go to get help and compare notes on running OpenStack.

by apevec at October 03, 2019 08:26 PM

Aptira

Real-world Open Networking. Part 3 – Interoperability: Problems with Standards

Real-world Open Networking. Part 3 – Interoperability: Problems with Standards

In our last post we unpacked Interoperability, including Open Standards. Continuing this theme, we will look at how solution developers implement standards compliance and the problems that arise.

Introduction

Mandating that vendors (and internal systems) comply with Open Standards is a strategy used by organisations to drive interoperability. The assumption is that Open Standards compliant components will be interoperable.

In this post we examine the many reasons why that assumption does not always hold in real-world situations. This analysis will be from the Software perspective, since generally network equipment does a better job of component interoperability than software. This post will also cover general aspects of standards compliance in this post, and the specific aspects of API’s in the next post.

Software Implementation & Interoperability based on Standards

Whether the standard is “de jure” or “de facto”, there are three basic approaches to implementing software compliance with the standards:

  • Reference implementation compatible
  • Reference document compatible
  • Architecture pattern or guideline compatible

Reference Implementation Compatible

This approach consists of two parts:

  • The standard is a controlling design input: i.e. compliance overrides other design inputs; and
  • Validation against a “reference implementation” of the standard.

reference implementation” is a software component that is warranted to comply with the standard and is a known reference against which to validate a developed component. This should also include a set of standard test cases that verify compliance and / or highlight issues. 

Vendors often provide the test results as evidence and characterisation of the level of compliance. 

Benefits of this approach

This is the highest level of compliance possible against a standard. Two components that have been validated against the standard will be interoperable at the lowest common level to which they have both passed the test.

Problems with this approach

A reference implementation must exist and be available, however this is not always the case. The reference implementation must be independently developed and certified, often by the standards body themselves.

Reference Document Compatible

This approach is similar to the Reference Implementation approach. Firstly, the documented standard is a controlling design input. However the second part (validation against the standard) is both optional and highly variable. At the most basic level, compliance could be just that the vendor asserts component compliance with the standard. Alternatively, compliance may be validated by comparison between the developed component and the documentation, and there are many ways to do this at varying levels of accuracy and cost.

Benefits of this approach

The main benefits of this approach is that the design is driven to the standard, and at this level it is equivalent to the Reference implementation approach.

Problems with this approach

Validation without a reference implementation is highly manual and potentially subject to interpretation. This type of validation is very expensive which creates cost pressure for vendors to only partially validate, especially on repeat version upgrades and enhancements.

Architecture Pattern Compatible

In this case the standard is used as one input to the design, but not as the controlling input. The intent is not compliance but alignment. The product may use the same or similar underlying technologies as defined in the standards (e.g. REST interfaces or the same underlying data representation standards such as XML or JSON. The vendor may adopt a similar component architecture (e.g. microservices) to the standard.

Benefits of this approach

At best, this approach may provide a foundation for future compliance.

Problems with this approach

In general, the vendor is designing their product to be “not incompatible” with the standard, without taking on the cost of full compliance.

Rationale for Vendors to Implement Standards

Standards compliance is expensive to implement, regardless of the approach taken. So each vendor will take on their own approach, based on their own situations and context. A vendor may:

  • Completely ignore the standards issue
  • Deliberately, e.g. a start-up, whose early target customers don’t care.
  • Accidentally, if they are unaware of the standards.
  • Not see a competitive advantage in their marketplace: not so much as to justify the cost of standards implementation.
  • Adopt a customisation approach: in other words, implement standardisation when required.
  • Have full compliance in their roadmap for future implementation and simply want a foundation to build on.

Problems with compliance

There are a wide range of implementations and the results are highly variable. The important thing to remember is that a claim of “standards compliance” can mean many things.

From a starting point of the intent to comply (or at least claim compliance), and using any of the strategies above, a vendor can be non-compliant in many ways:

  • Partial implementation, e.g. a custom solution for one customer that is “productised”;
  • Defects in implementation, including misinterpretation of the standard;
  • Deliberate forking of the standard, including the implementation of superset functionality (“our solution is better than the standard”);
  • The incompatibility of underlying or related components;
  • Compliance with limited subsets of the standard, e.g. the most often used functions;
  • Some vendors may misrepresent compliance based on tenuous connections: e.g. , a vendor might claim compatibility on the basis that their API’s are REST-based and nothing more.

Conclusion

Nothing can be assumed about standards compliance, other than that each vendor’s claims must be validated. The other part of this issue is Application Program Interfaces (API) interoperability. We will cover this in the next post. Stay tuned.

Become more agile.
Get a tailored solution built just for you.

Find Out More

The post Real-world Open Networking. Part 3 – Interoperability: Problems with Standards appeared first on Aptira.

by Adam Russell at October 03, 2019 03:55 AM

October 02, 2019

OpenStack Superuser

Meet the Shanghai Open Infrastructure Superuser Award nominees

Who do you think should win the Superuser Award for the Open Infrastructure Summit Shanghai?

When evaluating the nominees for the Superuser Award, take into account the unique nature of use case(s), as well as integrations and applications of open infrastructure by each particular team. Rate the nominees before October 8 at 11:59 p.m. Pacific Daylight Time.

Check out highlights from the five nominees and click on the links for the full applications:

  • Baidu ABC Cloud Group and Edge Security Team, who integrated Kata Containers into the fundamental platform for the entire Baidu internal and external cloud services, and who built a secured environment upon Kata Containers for the cloud edge scenario, respectively. Their cloud products (including VMs and bare metal servers) cover 11 regions including North and South China, 18 zones and 15 clusters (with over 5000 physical machines per cluster).
  • FortNebula Cloud, a one-man cloud show and true passion project run by Donny Davis, whose primary purpose is to give back something useful to the community, and secondary purpose is to learn how rapid fire workloads can be optimized on OpenStack. FortNebula has been contributing OpenDev CI resources since mid 2019, and currently provides 100 test VM instances which are used to test OpenStack, Zuul, Airship, StarlingX and much more. The current infrastructure sits in a single rack with one controller, two Swift, one Cinder and 9 compute nodes; total cores are 512 and total memory is just north of 1TB.
  • InCloud OpenStack Team, of Inspur, who has used OpenStack to build a mixed cloud environment that currently provides service to over 100,000 users, including over 80 government units in mainland China. Currently, the government cloud has provided 60,000+ virtual machines, 400,000+ vcpu, 30P+ storage for users, and hosts 11,000+ online applications.
  • Information Management Department of Wuxi Metro, Phase II of  the Wuxi Metro Cloud Platform project involved the evolution of IaaS to PaaS on their private cloud platform based on OpenStack. In order to acquire IT resources on demand and improve overall business efficiency, Wuxi Metro adopted Huayun Rail Traffic Cloud Solution, which was featured by high reliability, high efficiency, ease of management and low cost.
  • Rakuten Mobile Network Organization, of Rakuten Inc., Japan, launched a new initiative to enter the mobile market space in Japan last year as the 4th Mobile Network Operator (MNO), with a cloud-based architecture based on OpenStack and Kubernetes. They selected to run their entire cloud infrastructure on commercial, off-the-shelf (COTS) x86 servers, powered by Cisco Virtualized Infrastructure Manager (CVIM), an OpenStack-based NFV platform. The overall plan is to deploy several thousand clouds running vRAN workloads spread across all of Japan to a target 5M mobile phone users. Their current deployment includes 135K cores, with a target of one million cores when complete.

Each community member can rate the nominees once by October 8 at 11:59 p.m. Pacific Daylight Time.

Previous winners include AT&T, City Network, CERN, China Mobile, Comcast, NTT Group, the Tencent TStack Team, and VEXXHOST.

The post Meet the Shanghai Open Infrastructure Superuser Award nominees appeared first on Superuser.

by Superuser at October 02, 2019 06:11 AM

Aptira

Real-world Open Networking. Part 2 – Interoperability: The Holy Grail

Real-world Open Networking. Part 2 – Interoperability: The Holy Grail

In our last post we described the attributes of an Open Network Solution. In this post we unpack what surely is the “Holy Grail” of Open Networks: Interoperability. Interoperability is defined as:

Interoperability is a characteristic of a product or system, whose interfaces are completely understood, to work with other products or systems, at present or in the future, in either implementation or access, without any restrictions

Wikipediahttps://en.wikipedia.org/wiki/Interoperability

Bottom line, interoperability means that I can freely substitute components, i.e. If I have Component X in my Open Networking solution, then in the future I can freely replace it with Component Y.

Today, components vary widely in that ability, across any of the component “form factors” we described in this post. But it’s not just technology components that play into our concept of interoperability.

Interoperability aspects of Open Systems

By definition, to be highly-interoperable, we need to be able to freely substitute the components along each of the dimensions of openness that we described in the here.

  • Open Standards
  • Open API’s
  • Open Partners
  • Open Source
  • Open Operations

Let’s look briefly at each of these in turn.

Open Standards

Probably the most obvious and most-used strategy for driving interoperability has been the definition and/or adoption of standards. This covers both those standards formally established by standards bodies (“de jure” standards) and those established informally by market dominance or pervasive use (“de facto” standards).

The idea of Open Standards does not really mean substituting different Standards for each other (although solution designers should probably consider Standard-switching costs in their design phase). Using Open Standards means selecting one standard for a particular functional area of the solution and driving the level of compliance to a this standard to allow the free switching of components within that functional area of the solution.

This is a complex subject which we are going to review in the next post.

Open API’s

From a software integration perspective, Application Programming Interfaces (API’s) are a fundamental way in which interoperability can be achieved. There are a number of perspectives to the “Openness” of an API:

  • Ease of accessibility to the specification
  • Access to a test environment or reference implementation
  • Use of common and open interface protocols e.g. REST

API’s and the issues of interoperability is also quite a complex topic, and we’ll explore this in more detail in the next post plus one.

Open Partners

Being able to switch partners freely is key to open systems, but there are several factors to consider:

  • Contractual relationships: It makes sense to set up a firm relationship as a baseline but also to be able to modify or terminate the contract if circumstances require. Also relevant is the ability to partner in novel and creative ways, for example joint ventures and reseller arrangements.
  • Technology transparency: A vendor does not use or rely on proprietary components that cannot be transferred to other partners or used in-house by the solution owner.
  • Win-Win relationships: Open Partners depends on establishing and promoting a win-win relationship between suppler and customer, rather than one benefiting at the expense of the other.

Unfortunately, there have been many instances of vendors attempting (and succeeding) in locking themselves into customers long-term. This form of “rent seeking” generates ill-will and gives rise to many protective measures, some of them extreme and working against the idea of “Win-Win”.

Open Source

We discussed Open Source in previously, but we didn’t examine this from an interoperability perspective. Open source addresses the technology transparency requirement of Open Partners- it is accessible to all and is therefore transparent. Although effort may be required to ramp up knowledge of a new vendor, Open Source components often have multiple sources of support, knowledge and resources that can assist in this process.

Open Operations

As mentioned in the last post, Open Operations means that operational processes are flexible and open to change, open to engagement, and transparent. In short, this summarises DevOps.

We reviewed the DevOps paradigm in an earlier article. Practices are open when new vendors can slot in without friction or significant overhead. There is a clean flow-through from business to development to operations that enables new vendors to rapidly pick up the pace of value-add. The use of common concepts, tools, and practice enables new vendors to understand where they fit in very quickly.

Conclusion

We can see from the above that all the attributes of Open solutions drive Interoperability. Most business cases and project plans contain risk assessments and sometimes financial provisions for dealing with the costs of change that may occur during a project: a failure of a partner, or a piece of technology or some other aspect of the solution will incur costs and delays as the new components are selected, validated and integrated into the solution.

The only real test of interoperability is that these risk events, if they occur, are orders of magnitude less difficult and costly, such that the risk provisions can be reduced or eliminated. We know what to aim for, but alas we are far from that stage. In this post we have covered some of the issues that cause this failed assumption. We’ll cover more in our next post.

Stay tuned.

Become more agile.
Get a tailored solution built just for you.

Find Out More

The post Real-world Open Networking. Part 2 – Interoperability: The Holy Grail appeared first on Aptira.

by Adam Russell at October 02, 2019 05:36 AM

Trinh Nguyen

"Searchlight for U" at the Korea&Vietnam OpenInfra User Group meetup

Last night, in a cozy conference room in Seoul, South Korea, I had had a very friendly meetup with the OpenStack Korea User Group with around ten or so people. I and Sa Pham, the Vietnam OpenInfra User Group representatives, were there to share our experiences on OpenStack and networking with others. This is not my first time with the Korea User Group but meeting people working on open source projects or want to learn about OpenInfra technologies made me super excited.

Like last time, I had a brief presentation about OpenStack Searchlight showing folks what was going on and my plan for the Ussuri development cycle. And, that is why the title of my talk is "Searchlight for U".


Even though in Train, I had not put much effort into Searchlight but while presenting people the progress, I was amazed how far we have gone. I had been Searchlight's PTL for two cycles and now one more time. Hopefully, I could move the project forward with some real-world adaptation, use cases, and especially being able to attract more contributors.


We only had three presentations in total and it was fast because we would want to spend more time on networking and knowing each other. In the end, I and the Korea User Group organizers discussed our plan for the next OpenInfra study sessions. We then said goodbye and promised to have this kind of event more frequently.

I really had a great time yesterday.

by Trinh Nguyen (noreply@blogger.com) at October 02, 2019 02:08 AM

October 01, 2019

Mirantis

Democratizing Connectivity with a Containerized Network Function Running on a K8s-Based Edge Platform — Q&A

The Facebook-initiated Magma project makes it possible to run Containerized Network Functions in an Edge Cloud environment, thus opening up a whole range of capabilities for providers that may otherwise be limited in what they can provide.

by Nick Chase at October 01, 2019 09:50 PM

OpenStack Superuser

Shanghai Superuser Award Nominee – FortNebula

It’s time for the community to help determine the winner of the Open Infrastructure Summit Shanghai Superuser Awards. The Superuser Editorial Advisory Board will review the nominees and determine the finalists and overall winner after the community has had a chance to review and rate nominees.

Now, it’s your turn.

FortNebula Cloud is one of five nominees for the Superuser Awards. Review the nomination criteria below, check out the other nominees and rate the nominees before the deadline October 8 at 11:59 p.m. Pacific Daylight Time.

Rate them here!

Who is the nominee?

FortNebula Cloud – Donny Davis

FortNebula Cloud was nominated by a community member, so we reached out to Davis to provide some extra context. See both the nomination, and Davis’ responses below.

How has open infrastructure transformed the organization’s business?

The FortNebula cloud is Davis’ mad scientist garage cloud. As such I’m not sure we can speak to how it has transformed culture or business, but we can be amazed at how one person is able to do so much with few resources. Davis does this largely on his own time at home, and is able to provide a good chunk of OpenDev’s CI resources.

Davis:

I don’t have a business or make any money off this cloud. This is completely privately funded. This project’s primary purpose is to give something useful back to the community, and its secondary purpose is to learn how rapid fire workloads can be optimized on OpenStack. I am a one man show that does this purely in my off-time… because building clouds that do real things is fun.

How has the organization participated in or contributed to an open source project?

FortNebula has been contributing OpenDev CI resources since about the end of July 2019. We currently get 100 test VM instances from FortNebula cloud which are used to test OpenStack, Zuul, Airship, StarlingX and much more.

What open source technologies does the organization use in its open infrastructure environment?

FortNebula cloud runs OpenStack deployed with TripleO. Other technologies that are used include gnocchi and grafana.

Davis:

Fortnebula uses Open Source technologies for every single component if possible. The current inventory of software is ansible, puppet, Openstack, CentOS, Ubuntu, freebsd, and pfSense.

What is the scale of your open infrastructure environment?

There are currently 9 OpenStack compute nodes. Unfortunately, I do no run the cloud so do not have detailed numbers for things like cores/memory. The information I do have can be found at https://grafana.fortnebula.com/d/9MMqh8HWk/openstack-utilization

Davis:

The current infrastructure sits in a single rack with one controller, two swift, one cinder and 9 compute nodes. Total cores are 512 and total memory is just north of 1TB.

What kind of operational challenges have you overcome during your experience with open infrastructure?

As part of the onboarding with OpenDev the FortNebula cloud has had to be refactored a couple times to better meet the demands of a CI environment. Nodepool may request many instances all at once and that has to be handled. Test node disk IO throughput was too slow in the initial build out which led to replacing storage with faster devices and centralizing instance root disk hosting.

Davis:

Well the first challenge to meet was IO, as my ceph storage on spinning disks did not perform well for the workload. I moved all the compute nodes to local storage built on NVME. The second was network performance. My network is unique in that I use BGP to each OpenStack tenant, and my edge router advertises a whole subnet for each tenant, which then uses a tunnel broker to provide direct IPv6 connectivity. Through working directly with the infra community we were able to use OpenStack itself to optimize the traffic flows so the infrastructure could keep up with the workloads.

How is this team innovating with open infrastructure?

The FortNebula cloud is showing that you can build an effective OpenStack cloud with a small number of institutional resources as well as human resources. In addition to that, FortNebula is a predominantly IPv6 first cloud. We are able to give every test instance a public IP address by embracing IPv6.

Davis:

FortNebula is a demonstration of what one person and one rack of old equipment can do. If some guy in his basement can build a CI grade cloud, just imagine what a team of dedicated people and real funding could do for a business. It’s also an example to show that open infra is not that hard, even for someone to do in their off time.

Each community member can rate the nominees once by October 8 at 11:59 p.m. Pacific Daylight Time.

The post Shanghai Superuser Award Nominee – FortNebula appeared first on Superuser.

by Superuser at October 01, 2019 08:13 PM

Shanghai Superuser Award Nominee: Baidu ABC Cloud Group & Security Edge teams

It’s time for the community to help determine the winner of the Open Infrastructure Summit Shanghai Superuser Awards. The Superuser Editorial Advisory Board will review the nominees and determine the finalists and overall winner after the community has had a chance to review and rate nominees.

Now, it’s your turn. 

The Baidu ABC Cloud Group and Security Edge team is one of five nominees for the Superuser Awards. Review the nomination criteria below, check out the other nominees and rate the nominees before the deadline October 8 at 11:59 p.m. Pacific Daylight Time.

 Rate them here!

Who is the nominee?

Baidu (Nasdaq: BIDU), a dominant Chinese search engine operator and the largest Chinese website in the world and a global leading AI company, has over 800,000 clients, more than 30,000 employees, and nearly 15,000 patents. In 2018, the company reported an annual revenue of $14 billion.

Application units: Baidu ABC (AI, Big data, and Cloud computing) Cloud Group who integrated Kata Containers into fundamental platform for entire Baidu internal and external cloud services, and Baidu Security Edge team who built a secured environment upon Kata Containers for cloud edge scenario.

Members: Xie Guangjun, Zhang Yu,He Fangshi, Wang Hui, Shen Jiale,Ni Xun, Hang Ding,Bai Yu, Zhou Yueqian,Wu Qiucai

How has open infrastructure transformed the organization’s business?

In 2019, our Kata Containers based products are enjoying market success in areas of FaaS (Function as a Service), CaaS (Container as a Service) and edge computing. Baidu’s cloud function computing service (CFC) based on Kata Containers provided computing power for nearly 20,000 skills of over 3,000 developers to run cloud function computing for Baidu DuerOS (a conversational AI operating system with a “100 million-scale” installation base). Baidu Container Instance service (BCI) has built a multitenant-oriented serverless data processing platform for the internal big data business of Baidu’s big data division. The Baidu Edge Computing (BEC) node is open to all clients while keeping them separated from each other for security and ensuring high performance.

How has the organization participated in or contributed to an open source project?

Baidu is very actively involved in collaboration across open source communities. They are a golden member of CNCF foundation, Premier Member of LF AI Foundation, Hyperledger Foundation and LF Edge foundation and silver member of Apache Foundation.

Baidu is maintaining more than 100 open source projects on github, including Apollo, an open source autonomous driving platform, PaddlePaddle, a Deep Learning Framework and more.

For Kata containers, Baidu has more than 16 functional patch modifications, among which 8 patch sets were contributed to the community. We also published white paper of Kata Container production practice as contribution to the community.

What open source technologies does the organization use in its open infrastructure environment?

Baidu has used Kata Containers to provide high performance and protect data security and the confidentiality of algorithms in different cloud computing scenarios. By using the OpenStack control plane, we seamlessly integrated Kata Container instances with Baidu cloud storage and network in BCI products.

Other technologies including QEMU-KVM/OpenStack (including nova, cinder, glance and neutron), Open vSwich and Kubernetes are used in Kata Containers-based products. Open source device mapper (Linux kernel) and qcow2 format are used in the storage performance optimization, and DPDK is applied for network performance optimization.

What is the scale of your open infrastructure environment?

Baidu has more than 500,000 machines deployed with Linux kernel based on community version.

Baidu Cloud products (including virtual machines and bare metal servers) cover 11 regions including North and South China, 18 zones and 15 clusters (distributed in different regions and zones), which covers tens of thousands of physical machines; One container cluster includes more than 5000 physical machines.

What kind of operational challenges have you overcome during your experience with open infrastructure?

  • Support of mounting user codes dynamically in container – As Kata Containers’ host and guest are not in the same kernel, static mounting before startup is certainly possible, but how to mount dynamically is a challenge.
  • Cold start performance optimization of cloud functions – By optimization, the performance of creation and startup can be as the same as RunC.
  • Function density optimization – The higher the function density is, the more services a physical function can provide, thus lowering the cost.
  • Extensibility of hardware – Baidu’s cloud also provides products and services in big data and AI. Hardware such as GPU for AI requires passthrough to the inside of the container. By using Kata Containers, Baidu achieved extensibility of hardware in AI services.

How is this team innovating with open infrastructure?

  • The Baidu Edge Computing (BEC) product requires a virtual machine with very light creation and release while reserving the similar device model. This product is based on Kata Container, and the bottom layer uses QEMU optimized by Baidu.
  • By integrating with modules such as Nova, Glance and Neutron on OpenStack, Baidu implemented co-location of container instance nodes and virtual machines.
  • Based on virtio-blk/virtio-scsi, Baidu optimized the file system of VM so that its performance is close to that of the host (single queue and single thread).
  • Baidu implemented a network scheme compatible with Neutron and Open vSwitch for network maintenance including network isolation and speed limit and reusing the previous network architecture.

Each community member can rate the nominees once by October 8 at 11:59 p.m. Pacific Daylight Time.

The post Shanghai Superuser Award Nominee: Baidu ABC Cloud Group & Security Edge teams appeared first on Superuser.

by Superuser at October 01, 2019 08:13 PM

Shanghai Superuser Award Nominee: Information Management Department of Wuxi Metro

It’s time for the community to help determine the winner of the Open Infrastructure Summit Shanghai Superuser Awards. The Superuser Editorial Advisory Board will review the nominees and determine the finalists and overall winner after the community has had a chance to review and rate nominees.

Now, it’s your turn.

The Information Management Department of Wuxi Metro is one of five nominees for the Superuser Awards. Review the nomination criteria below, check out the other nominees and rate the nominees before the deadline October 8 at 11:59 p.m. Pacific Daylight Time.

Rate them here!

Who is the nominee?

Information Management Department of Wuxi Metro 

How has open infrastructure transformed the organization’s business?

In 2019, Phase II of Wuxi Metro Cloud Platform was accepted. A private cloud platform based on OpenStack was used in both phases. Phase II of Wuxi Metro Cloud Platform project involved the evolution of IaaS to PaaS for the cloud platform and seamless integration of the cloud platform with the business to ensure safe operation and sustainable business development of Wuxi Metro.

Upon the completion of Phase II of Cloud Platform, Wuxi Metro will focus on its own business, and its resources, network and services are transparent and utilized on demand. With the functions provided at the business function layer and various services provided at the service layer, various business needs such as CI/CD process build, big data development and business application development will be realized quickly.

How has the organization participated in or contributed to an open source project?

In order to acquire IT resources on demand and improve overall business efficiency, Wuxi Metro adopted Huayun Rail Traffic Cloud Solution, which is the cloud computing service customized for customers in the rail transit industry. The solution was featured for high reliability, high efficiency, ease of management and low cost, and can help the rail transit industry customers transform from traditional IT to cloud computing, thus improving the information system service in many aspects and help the business development.

What open source technologies does the organization use in its open infrastructure environment?

Cloud resource management: through the support of multiple hypervisors, the cloud resource management can manage and schedule various virtual resources based on KVM, VMware, Xen and Hyper-v.

Heterogeneous resource management: basic resource platform can incorporate the management of existing physical servers and heterogeneous storage in a unified manner. Physical resources can be used as independent resources or integrated into virtualized resource pool. Storage devices that support OpenStack standard interface will be managed as the storage service centrally.

What is the scale of your open infrastructure environment?

The project integrated resources of nearly 100 servers, and a unified computing resource pool was built to provide nearly 1,500 virtual machines. It integrated multiple sets of storage resources, and a unified storage resource pool to provide a service capacity of nearly 180TB.

What kind of operational challenges have you overcome during your experience with open infrastructure? 

  • Security risk without the disaster recovery
  • Privilege confusion due to too many people involved
  • Unavoidable single point of failure
  • Scattered monitoring without a global view
  • Scattered devices which are difficult to manage
  • Performance decrease because of the data explosion
  • High cost and low utilization

How is this team innovating with open infrastructure?

  • Simplified management – Private cloud platform relies on the virtualization technology to implement effective integration of server resources, and private cloud platform involves specific functions, so the data center management difficulty is reduced greatly.
  • Effective utilization of resources – Perform the inventory evaluation of existing devices to reuse old devices, migrate all appropriate business systems to the cloud, benefit from efficient resource integration brought by the virtualization and cloud computing, utilize existing resources effectively, and reduce the number of devices.
  • High availability ensuring there is no service interruption
  • Fast business go-live
  • Efficient data protection
  • Business resource resilience
  • Improve the system security

Each community member can rate the nominees once by October 8 at 11:59 p.m. Pacific Daylight Time.

The post Shanghai Superuser Award Nominee: Information Management Department of Wuxi Metro appeared first on Superuser.

by Superuser at October 01, 2019 08:13 PM

Shanghai Superuser Award Nominee: InCloud OpenStack Team

It’s time for the community to help determine the winner of the Open Infrastructure Summit Shanghai Superuser Awards. The Superuser Editorial Advisory Board will review the nominees and determine the finalists and overall winner after the community has had a chance to review and rate nominees.

Now, it’s your turn. 

The InCloud OpenStack Team is one of five nominees for the Superuser Awards. Review the nomination below, check out the other nominees and rate the nominees before the deadline October 8 at 11:59 p.m. Pacific Daylight Time.

 Rate them here!

Who is the nominee? 

InCloud OpenStack Team is comprised of over 100 members who developed a OpenStack-based private and hybrid cloud platform, a smart cloud operating system designed for the next generation of cloud data centers and cloud-native applications.

It consists of four sub-teams:

  • Product design: requirement analysis and interaction design
  • Product architecture: solution design and technology research
  • Product development: feature design and implementation
  • Operations support: deployment, troubleshooting.

As a Gold Member of the OpenStack Foundation, Inspur is actively involved in OpenStack community. Team members include Kaiyuan Qi, Zhiyuan Su, Brin Zhang, Guangfeng Su.

How has open infrastructure transformed the organization’s business?

InCloud OpenStack is committed to transforming into a new type of cloud service provider for the government. InCloud OpenStack has ranked first in the government cloud market for five consecutive years. At present, we have provided government cloud services to more than 80 government units in mainland China. We use OpenStack to build a mixed cloud environment for our customers to achieve 100% of traditional and native cloud applications. The government cloud based on OpenStack reduces the online time of customer’s application system from 6 months to less than 1 week, saves 45% of the server input for customers and reduces 55% of the operation and maintenance costs. Currently, InCloud OpenStack has providing cloud services for more than 100,000 cloud users.

How has the organization participated in or contributed to an open source project?

As a Gold Member of the OpenStack Foundation, Inspur is actively involved in OpenStack community and is committed to being the top practitioner of OpenStack, supporting successful deployments in various industries, sharing the optimization and experience of large-scale deployment at the Austin and Denver Summits,  OpenStack China Days, OpenStack China Hacker Loose Activities, and Meet Up technology exchange.. Inspur is an OpenStack Foundation Gold Member, a member of CNCF, Linux Foundation, a core member and founding member of ODCC, OCP, and Open19. Inspur is actively involved in the open source community and is committed to being the top practitioner of OpenStack, commit: over 58 items, and reported over 41 bugs or issues in the OpenStack project.

What open source technologies does the organization use in its open infrastructure environment?

All of Inspur Cloud’s development and CI/CD tools are built using open source technologies, including, but not limited to: Chef, Ansible, Terraform, OpenStack, ELK, Kafka, Docker, Kubernetes, Jenkins, GO, keepalived, ETCD, Grafana, Influxdb, Kibana, Git, OVS, etc.

What is the scale of your open infrastructure environment?

Inspur public cloud and government cloud platforms adopt a variety of technical architectures and have a large overall scale. At present, they are migrating to OpenStack in the continuous exploration framework. At present, the size of clusters that have applied OpenStack is 5000+ nodes, which will grow rapidly in the future. Currently, the government cloud has provided 60,000+ virtual machines, 400,000+ vcpu, 30P+ storage for users, and hosts 11,000+ online applications. Inspur Cloud is building opscenter tools based on Kubernetes and unified region management. At present, the number of Kubernetes PODs is more than 5,000 (after LCM migration project is launched, it is expected to reach 30,000+). Wave cloud Devops cloud provides CI/CD environment for more than 10,000 developers.

What kind of operational challenges have you overcome during your experience with open infrastructure?

OpenStack’s components depend on MQ. When the scale of cluster is large, MQ is the bottleneck of expansion. We use the mode of Nova cells V2 and dedicated MQ cluster to solve this problem. When the memory of virtual machine is large and the dirty data generated is faster than the transmission speed, the virtual machine migration fails. We use the features of postcopy and coverage to solve this problem.

Hardware heterogeneity and expansion flexibility are the pain points of cloud computing network. Inspur cloud network effectively solves the problem of flexible expansion of virtual network functions through self-research EIP cluster and secondary development of Neutron, and shields the heterogeneity of underlying devices through self-research cluster.

How is this team innovating with open infrastructure?

In order to solve the reliability of key components such as MQ and DB that OpenStack depends on, we developed a self-developed system to implement MQ/DB fault monitoring and automated recovery function.We support innovations such as virtual machine CPU and memory heat upgrade, virtual machine priority local resize, and encryption dog by modifying OpenStack’s code.

On the basis of open source OVS and OpenStack Neutron, its virtual network architecture adds key functions missing in open source systems such as network ACL, VPC peer-to-peer connection, hybrid VxLAN interconnection, EIP cluster and so on.
Inspur InCloud OpenStack 5.6 has completed a test with a single cluster size of up to 500 nodes, which is currently the largest single-cluster test based on OpenStack Rocky in the world.

Each community member can rate the nominees once by October 8 at 11:59 p.m. Pacific Daylight Time.

The post Shanghai Superuser Award Nominee: InCloud OpenStack Team appeared first on Superuser.

by Superuser at October 01, 2019 08:12 PM

Shanghai Superuser Award Nominee – Rakuten Mobile Network Organization

It’s time for the community to help determine the winner of the Open Infrastructure Summit Shanghai Superuser Awards. The Superuser Editorial Advisory Board will review the nominees and determine the finalists and overall winner after the community has had a chance to review and rate nominees.

Now, it’s your turn.

The Rakuten Mobile Network Organization team is one of five nominees for the Superuser Awards. Review the nomination criteria below, check out the other nominees and rate the nominees before the deadline October 8 at 11:59 p.m. Pacific Daylight Time.

Rate them here!

Who is the nominee?

Rakuten Mobile Network Organization, whose team consists of 100+ members.

Core leaders of the team:

Tareq Amin, Ashiq Khan, Ryota Mibu, Yusuke Takano, Masaaki Kosugi, Yuichi Koike, Yuka Takeshita, Rahul Atri, Shinya Kita, Vineet Singh, Mohamed Aslam, Jun Okada, Sharad Sriwastawa, Sushil Rawat, Michael Treasure.

How has open infrastructure transformed the organization’s business?

In June of 2018, Rakuten Inc., Japan, launched a new initiative to enter into the highly competitive mobile market space in Japan as the 4th Mobile Network Operator (MNO), so that they can own the entire customer experience over their network. Rakuten has decided to push the cloud technology boundary to its limits and, in this regard, has gone with a cloud-based architecture based on OpenStack and Kubernetes for its mobile network. In its goal to get a fully automated, highly efficient, cost optimized solution, Rakuten has chosen to run their entire cloud infrastructure on commercial, off-the-shelf (COTS) x86 servers, powered by Cisco Virtualized Infrastructure Manager (CVIM), an OpenStack-based NFV platform.

Open source technology has made this a reality in an extremely short timeframe.

How has the organization participated in or contributed to an open source project?

Rakuten is an active user of the OpenStack technology. In this regard, they have pushed Cisco and Red Hat to back port features like trusted_vf, and Cinder multi-attach feature for RBD backend to Queens. Also, since the entire network is IPv6, they are the key proponents to get IPv6 going in Kubernetes.

What open source technologies does the organization use in its open infrastructure environment?

Rakuten uses CVIM, Cisco’s OpenStack infrastructure manager designed for use in highly distributed telco network environments. Rakuten is also using Kubernetes for their container workload, which is hosted on CVIM as well. Cisco VIM is composed of many open source components along with OpenStack, such as Prometheus, Telegraf, Grafana (for monitoring), Elasticsearch, fluentd, and Kibana (for logging), and a variety of deployment and automation tools. The OPNFV toolsets, VMTP and NFVBench, are integrated with CVIM’s OpenStack deployment to prove out networking functionality and performance requirements, key to delivering telco-grade SLAs. Ceph is used to provide fault-tolerant storage.

What is the scale of your open infrastructure environment?

Rakuten mobile network, in its full scale, will consist of several thousand clouds, each of which will run VNFs and CNFs that are critical to the mobile world. All these clouds are currently based on the Queens release of OpenStack and are orchestrated by CVIM (OpenStack Queens). Some of the clouds also run the VNFMs, OSS/BSS systems and private cloud for customer multimedia data storage and sharing.

The overall plan is to deploy several thousand clouds running vRAN workloads spread across all of Japan to a target 5M mobile phone users. These are small edge clouds that run mobile radio specific workloads and need distributing across the countryside to be close to the antennae they control.

The deployment includes 135K cores, with a target of using up to a million cores when done.

What kind of operational challenges have you overcome during your experience with open infrastructure?

The main challenge associated with his network is the sheer number of clouds associated with the solution. Planning and operationalizing hardware and software updates/upgrades, rolling out new features, BIOS updates, security compliance, etc. and monitoring all of the clouds centrally with full automation is not only a challenge, but is pushing Rakuten and all its vendors towards the edge of technology. Also, the cloud is solely running over IPv6 which is a paradigm shift in the industry. In order to meet such immense challenges, Rakuten Mobile has developed an operation system (OSS) which performs IP address generation and allocation, VNF instantiation and their lifecycle management, mobile basestation commissioning in a fully automated way, to name a few.

How is this team innovating with open infrastructure?

Given the size of the solution, automation is the only way out. Rakuten has invested heavily in automating every operation possible. This includes cloud installation, updates, and reconfiguration over the Rest API provided by CVIM. Also, the cloud has been adapated to handle low latency workloads. Also, Rakuten has created a staging lab, where all vendors bring in their software. Integration testing happens there and, once that passes, the software for every vendor is rolled out. Also, a CI/CD system has been developed, that picks up software from each vendor and rolls it into the test lab for testing to commence.

Rakuten will also be one of the pioneers in offering mobile gaming and low latency applications from its edge data centers using true multi-access edge computing (MEC).

 

Each community member can rate the nominees once by October 8 at 11:59 p.m. Pacific Daylight Time.

The post Shanghai Superuser Award Nominee – Rakuten Mobile Network Organization appeared first on Superuser.

by Superuser at October 01, 2019 08:12 PM

Aptira

Real World Open Networking. Part 1 – Unpacking Openness

Aptira Real-world Open Networking – Part 1 – Unpacking Openness

In our last post in this series, we completed our coverage of the third domain of Open Networking, that of Open Network Integration. To wrap up our overall Open Networking series, we begin a set of posts that address the practical reality of delivering Open Network solutions in the current marketplace, and particularly focusing on Interoperability.

Our first topic describes the attributes of an Open Network Solution.

What is an Open Network Solution?

Aptira describes Open Networking as the alignment of technology capabilities, into a holistic practice that designs, builds and operates a solution. Successfully implementing and operating a network needs a lot more than just technology and practices. 

A mandatory aspect of Open Network solutions, of course, is that they are actually “open”. But in the industry this term can be somewhat vague and has many meanings depending on who you talk to. 

As part of the description and definition of Open Networks in this series of posts, Aptira puts forward our definition of “openness”, and describes how this definition can be practically applied to the solutions that we build. 

Openness – Key attributes of an Open Network Solution

Rather than completely “re-invent the wheel”, we’ve adapted and enhanced one of many existing models of Open Networking, that of Aruba Networks. Aptira’s extended model of the key attributes of an Open Network solution consists of five attributes that are essential for solution constructors to leverage existing components in a solution:

  • Open Standards – standards (both “de facto” and “de jure”) are prominent in the Open Networking domain. These include the protocols that devices use to interact within a network or externally across other networks;
  • Open APIs – the ability to “program your infrastructure as code” based on a set of documented, supported and robust API’s that allow solution constructors to use “composable infrastructure” – to some extent, Open API’s also fall under the standards capability, but we will discuss this in more detail;
  • Open (Partner) Ecosystem – enables rapid and frictionless onboarding of new infrastructure into solutions. Aptira strongly believes that an Open Partner Ecosystem must include the principle of “Open Skills” – what we call “teaching our customers to fish” – to ensure that customers do not remain dependent on vendor resources but build skills internally or at least can select alternative sources for required skillsets;
  • Open Source – the power and value of open source component availability is key, and we’ve examined this link.
  • Open Operations –the ability of an operator rapidly to develop, deploy and operate complex integrated solutions sustainably over time, with high integrity and high flexibility. Since Aptira has added this to the original model, we’ll explain this in more detail below.

In the current marketplace, each of these attributes has practical implementation challenges, but they are all required if a network solution is to be considered truly open.

Open Operations

Aptira added “Open Operations” to the original model, as well as making some other adaptations.

Open Operations is as much related to the operator’s organisational capabilities and skillsets as it is to the technology and partner considerations in the other four pillars, but it is a critical success factor for an operator successfully to capture the promised value of the Open Networking proposition. It extends to the entire solution lifecycle, including the development phase, and is critical in reducing or eliminating the gaps and organisational transitions that occur in the rollout of capability from the development stage of the lifecycle into the production operations stages.

Open Operations is partly, and most importantly, fulfilled by DevOps, but it is more than that. Open Operations is tightly intertwined with the Open Partner Ecosystem to enable third-party participation in operational processes, as required to meet business objectives.

Why do we need Openness?

The holy grail of “openness” is Interoperability, or the ability to interchange components at will.

Whilst this term originally applied to just technical components (network equipment and software), in Open Network solutions it has a broader scope, covering technology, skills and processes amongst other things. Essentially, interoperability means no lock-in, either to a particular vendor or a particular technology choice or product line.

Given that Open Networking solutions are “multi-everything”: multi-vendor, multi-technology, multi-location, multi-product etc, this objective is very important.

We will expand on Interoperability in the next post.

Stay tuned.

Remove the complexity of networking at scale.
Learn more about our SDN & NFV solutions.

Learn More

The post Real World Open Networking. Part 1 – Unpacking Openness appeared first on Aptira.

by Adam Russell at October 01, 2019 02:58 AM

September 30, 2019

OpenStack Superuser

Review of Pod-to-Pod Communications in Kubernetes

Nowadays, Kubernetes has changed the way software development is done. As a portable, extensible, open-source platform for managing containerized workloads and services that facilitates both declarative configuration and automation, Kubernetes has proven itself to be a dominant player for managing complex microservices. Its popularity stems from the fact that Kubernetes meets the following needs: businesses want to grow and pay less, DevOps want a stable platform that can run applications at scale, developers want reliable and reproducible flows to write, test and debug code. Here is a good article to learn more about Kubernetes evolution and architecture.

One of the important areas of managing Kubernetes network is to forward container ports internally and externally to make sure containers and Pods can communicate with one another properly. To manage such communications, Kubernetes offers the following four networking models:

  • Container-to-Container communications
  • Pod-to-Pod communications
  • Pod-to-Service communications
  • External-to-internal communications

In this article, we dive into Pod-to-Pod communications by showing you ways in which Pods within a Kubernetes network can communicate with one another.

While Kubernetes is opinionated in how containers are deployed and operated, it is very non-prescriptive of how the network should be designed in which Pods are to be run. Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):

  • All pods can communicate with all other pods without NAT
  • All nodes running pods can communicate with all pods (and vice-versa) without NAT
  • IP that a pod sees itself as is the same IP that other pods see it as

For the illustration of these requirements let us use a cluster with two cluster nodes. Nodes are in subnet 192.168.1.0/24 and Pods use 10.1.0.0/16 subnet, with 10.1.1.0/24 and 10.1.2.0/24 used by node1 and node2 respectively for the Pod IP’s.

So from above, Kubernetes requirements following communication paths must be established by the network.

  • Nodes should be able to talk to all pods. For e.g. 192.168.1.100 should be able to reach 10.1.1.2, 10.1.1.3, 10.1.2.2 and 10.1.2.3 directly (without NAT)
  • A Pod should be able to communicate with all nodes. For e.g. Pod 10.1.1.2 should be able to reach 192.168.1.100 and 192.168.1.101 without NAT
  • A Pod should be able to communicate with all Pods. For e.g 10.1.1.2 should be able to communicate with 10.1.1.3, 10.1.2.2 and 10.1.2.3 directly (without NAT)

While exploring these requirements, we will lay the foundation for how the services are discovered and exposed. There can be multiple ways to design the network that meets Kubernetes networking requirements with varying degrees of complexity and flexibility.

Pod-to-Pod Networking and Connectivity

Kubernetes does not orchestrate setting up the network and offloads the job to the CNI plug-ins. Here is more info for the CNI plugin installation. Below are possible network implementation options through CNI plugins which permits Pod-to-Pod communication honoring the Kubernetes requirements:

  1. Layer 2 (switching) solution
  2. Layer 3 (routing) solution
  3. Overlay solutions


I- Layer 2 Solution

This is the simplest approach and should work well for small deployments. Pods and nodes should see subnet used for Pod’s IP as a single l2 domain. Pod-to-Pod communication (on same or across hosts) happens through ARP and L2 switching. We could use bridge CNI plug-in to reuse a L2 bridge for pod containers with the below configuration on node1 (note /16 subnet).

 

{

“name”: “mynet”,

“type”: “bridge”,

“bridge”: “kube-bridge”,

“isDefaultGateway”: true,

“ipam”: {

“type”: “host-local”,

“subnet”: “10.1.0.0/16”

}

}

 

kube-bridge needs to be pre-created such that ARP packets go out on the physical interface. In order for that we have another bridge with physical interface connected to it and node IP assigned to it to which kube-bridge is hooked through the veth pair like below.

We can pass a bridge which is pre-created, in which case bridge CNI plugin will reuse the bridge.

II- Layer 3 Solutions

A more scalable approach is to use node routing rather than switching the traffic to the Pods. We could use bridge CNI plug-in to create a bridge for Pod containers with gateway configured. For e.g. on node1 below the configuration can be used (note /24 subnet).

{

“name”: “mynet”,

“type”: “bridge”,

“bridge”: “kube-bridge”,

“isDefaultGateway”: true,

“ipam”: {

“type”: “host-local”,

“subnet”: “10.1.1.0/24”

}

}

So how does Pod1 with IP 10.1.1.2 running on node1 communicate with Pod3 with IP 10.1.2.2 running on node2? We need a way for nodes to route the traffic to other node Pod subnets.

We could populate the default gateway router with routes for the subnet as shown in the below diagram. Routes to 10.1.1.0/24 and 10.1.2.0/24 are configured to be through node1 and node2 respectively. We could automate keeping the route tables updated as nodes are added or deleted in to the cluster. We can also use some of the container networking solutions which can do the job on public clouds, for e.g. Flannel’s backend for AWS and GCE, Weave’s AWS-VPC mode, etc.

Alternatively, each node can be populated with routes to the other subnets as shown in the below diagram. Again, updating the routes can be automated in small/static environment as nodes are added/deleted in the cluster or container networking solutions like calico, or Flannel host-gateway backend can be used.

III- Overlay Solutions

Unless there is a specific reason to use an overlay solution, it generally does not make sense considering the networking model of Kubernetes and it lacks of support for multiple networks. Kubernetes requires that nodes should be able to reach each Pod, even though Pods are in an overlay network. Similarly, Pods should be able to reach any node as well. We will need host routes in the nodes set such that Pods and nodes can talk to each other. Since inter host Pod-to-Pod traffic should not be visible in the underlay, we need a virtual/logical network that is overlaid on the underlay. Pod-to-Pod traffic would need to be encapsulated at the source node. The encapsulated packet is then forwarded to the destination node where it is de-encapsulated. A solution can be built around any existing Linux encapsulation mechanism. We need to have a tunnel interface (with VXLAN, GRE, etc. encapsulation) and a host route such that inter node Pod-to-Pod traffic is routed through the tunnel interface. Below is a very generalized view of how an overlay solution can be built that can meet Kubernetes network requirements. Unlike two previous solutions there is significant effort in the overlay approach with setting up tunnels, populating FDB, etc. Existing container networking solutions like Weave and Flannel can be used to setup a Kubernetes deployment with overlay networks. Here is a good article for reading more on similar Kubernetes topics.

 

Conclusion

In this article, we covered how the cross-node Pod-to-Pod networking works, how services are exposed with-in the cluster to the Pods, and externally. What makes Kubernetes networking interesting is how the design of core concepts like services, network policy, etc. permits several possible implementations. Though some core components and add-ons provide default implementations, they are replaceable. There is a whole ecosystem of network solutions that plug neatly into the Kubernetes networking semantics. Now that you learn how Pods inside a Kubernetes system can communicate and exchange data, you can move on to learn other Kubernetes networking models such as Container-to-Container or Pod-to-Service communications. Here is a good article for learning more advanced topics on Kubernetes development.

 

About the Author

This article is written by Matt Zand who is the founder of High School Technology Services, DC Web Makers and Coding Bootcamps.  He has written extensively on advance topics on web design, mobile App development and blockchain. He is a senior editor at Touchstone Words where he writes and reviews coding and technology articles. He is also senior instructor and developer living in Washington DC. You can follow him on Linkedin.

Photo // CC BY NC

The post Review of Pod-to-Pod Communications in Kubernetes appeared first on Superuser.

by Matt Zand at September 30, 2019 02:00 PM

Galera Cluster by Codership

EverData reports Galera Cluster outshines Amazon Aurora and RDS

EverData a leading data center and cloud solution provider in India has recently been writing quite a bit about Galera Cluster and it seems like we should highlight them. For one, they’ve talked about streaming replication in Galera Cluster 4, available in MariaDB Server 10.4. However, let us focus on their post: Galera Cluster vs Amazon RDS: A Comprehensive Review.

They compared MySQL with MHA, MySQL with Galera Cluster and also Amazon Web Services (AWS) Relational Database Service (RDS). Their evaluation criteria was to see how quickly there would be recovery after a crash, as well as performance while managing concurrent reads and writes.

In their tests, they found that time for failover with Galera Cluster between 8-10 seconds, whereas Aurora (so they were not using Amazon RDS?) took between 15-51 seconds, and MySQL with MHA took 140 seconds (which seems excessively high, considering this solution has been known to do sub-10 second failovers — so maybe needing some configuration tuning?)

In terms of performance, MySQL with MHA comes out the winner over Galera Cluster, but Galera Cluster comes out ahead of RDS (is this RDS MySQL or Aurora?). This can simply be explained that MySQL with MHA was likely configured with asynchronous replication and not semi-synchronous, and you’re only writing to one node as opposed to three nodes. (MHA setups recommend the use of semi-synchronous replication; also later on in the report, there is a note about replication lag).

Some takeaways in conclusion, straight from their report:

  • “If HA and low failover time are the major factors, then MySQL with Galera is the right choice.”
  • High Availability: “MySQL/Galera was more efficient and consistent, but the RDS didn’t justify the episodes of replication lags.”
  • Performance: “MySQL/Galera outperformed the RDS in all tests — by the execution time, number of transactions, and rows managed.”

“we can see that MySQL/Galera manages the commit phase part more efficiently along with the replication and data validation.” Source: EverData

Looks extremely positive from a Galera Cluster standpoint, and we thank the team at EverData for such a report. It is interesting to read so take look Galera Cluster vs Amazon RDS: A Comprehensive Review.  

 

 

by Sakari Keskitalo at September 30, 2019 12:20 PM

September 28, 2019

Christopher Smart

Using pipefail with shell module in Ansible

If you’re using the shell module with Ansible and piping the output to another command, it might be a good idea to set pipefail. This way, if the first command fails, the whole task will fail.

For example, let’s say we’re running this silly task to look for /tmp directory and then trim the string “tmp” from the result.

ansible all -i "localhost," -m shell -a \
'ls -ld /tmp | tr -d tmp'

This will return something like this, with a successful return code.

localhost | CHANGED | rc=0 >>
drwxrwxrw. 26 roo roo 640 Se 28 19:08 /

But, let’s say the directory doesn’t exist, what would the result be?

ansible all -i "localhost," -m shell -a \
'ls -ld /tmpnothere | tr -d tmp'

Still success because of the pipe to trim was successful, even though we can see the ls command failed.

localhost | CHANGED | rc=0 >>
ls: cannot access ‘/tmpnothere’: No such file or directory

This time, let’s set pipefail first.

ansible all -i "localhost," -m shell -a \
'set -o pipefail && ls -ld /tmpnothere | tr -d tmp'

This time it fails, as expected.

localhost | FAILED | rc=2 >>
ls: cannot access ‘/tmpnothere’: No such file or directorynon-zero return code

If /bin/sh on the remote node does not point to bash then you’ll need to pass in an argument specifying bash as the executable to use for the shell task.

  - name: Silly task
   shell: set -o pipefail && ls -ld /tmp | tr -d tmp
   args:
     executable: /usr/bin/bash

Ansible lint will pick these things up for you, so why not run it across your code 😉

by Chris at September 28, 2019 10:43 AM

September 27, 2019

Chris Dent

Placement Update 19-∞

Let's call this placement update 19-∞, as this will be my last one. It's been my pleasure to provide this service for nearly three years. I hope it has been as useful to others as it has been for me. The goal all along was to provide some stigmergic structures to augment how we, the placement team, collaborated.

I guess it worked: yesterday we released a candidate that will likely become the Train version of placement. The first version where the only placement you can get is from its own repo/project. Thanks to everyone who has made this possible over the years.

Most Important

Tetsuro will be the next placement PTL. He and Gibi are working on the project update and other Summit/PTG-related activities for Shanghai. If you have thoughts on that, please contact them.

The now worklist I made a couple weeks ago has had some progress, but some tasks remain. None of them were critical for the release, except perhaps the ongoing need for better documentation of how to most effectively use the service.

Since we're expecting the Ussuri release to be one where we consolidate how other services make use of placement, most of the items on that list can fit well with that.

We should be on the lookout for bugs reported by people trying to the release candidate(s).

Stories/Bugs

(Numbers in () are the change since the last pupdate.)

There are 21 (-2) stories in the placement group. 0 (0) are untagged. 7 (2) are bugs. 2 (-2) are cleanups. 9 (-1) are rfes. 4 (-1) are docs.

If you're interested in helping out with placement, those stories are good places to look.

osc-placement

There are several osc-placement changes, many of which are related to the cutting of a stable/train branch.

Main Themes

Consumer Types

Adding a type to consumers will allow them to be grouped for various purposes, including quota accounting.

Cleanup

Cleanup is an overarching theme related to improving documentation, performance and the maintainability of the code. The changes we made this cycle are fairly complex to use and were fairly complex to write. We need to make sure, with the coming cycle, that we help people use them well.

Other Placement

Miscellaneous changes can be found in the usual place.

There are two os-traits changes being discussed. And zero os-resource-classes changes.

Other Service Users

Since we're in RC period I'll not bother listing pending changes. During Ussuri the placement team hopes to be available to facilitate other projects using the service. Keeping track of those is what this section has been trying to do. Besides general awareness of what's going on, I've also used this gerrit query to find things that might be related to placement.

End

🙇

by Chris Dent at September 27, 2019 01:32 PM

September 23, 2019

Mirantis

PaaS vs KaaS: What’s the difference, and when does it matter?

Earlier this month I had the pleasure of addressing the issue of Platform as a Service vs Kubernetes as a Service. We talked about the differences between the two modes, and their relative strengths and weaknesses.

by Nick Chase at September 23, 2019 01:49 PM

StackHPC Team Blog

Bespoke Bare Metal: Ironic Deploy Templates

Ironic's mascot, Pixie Boots

Iron is solid and inflexible, right?

OpenStack Ironic's Deploy Templates feature brings us closer to a world where bare metal servers can be automatically configured for their workload.

In this article we discuss the Bespoke Bare Metal (slides) presentation given at the Open Infrastructure summit in Denver in April 2019.

BIOS & RAID

The most requested features driving the deploy templates work are dynamic BIOS and RAID configuration. Let's consider the state of things prior to deploy templates.

Ironic has for a long time supported a feature called cleaning. This is typically used to perform actions to sanitise hardware, but can also perform some one-off configuration tasks. There are two modes - automatic and manual. Automatic cleaning happens when a node is deprovisioned. A typical use case for automatic cleaning is shredding disks to remove sensitive data. Manual cleaning happens on demand, when a node is not in use. The following diagram shows a simplified view of the node states related to cleaning.

Ironic cleaning states (simplified)

Cleaning works by executing a list of clean steps, which map to methods exposed by the Ironic driver in use. Each clean step has the following fields:

  • interface: One of deploy, power, management, bios, raid
  • step: Method (function) name on the driver interface
  • args: Dictionary of keyword arguments
  • priority: Order of execution (higher runs earlier)

BIOS

BIOS configuration support was added in the Rocky cycle. The bios driver interface provides two clean steps:

  • apply_configuration: apply BIOS configuration
  • factory_reset: reset BIOS configuration to factory defaults

Here is an example of a clean step that uses the BIOS driver interface to disable HyperThreading:

{
  "interface": "bios",
  "step": "apply_configuration",
  "args": {
    "settings": [
      {
        "name": "LogicalProc",
        "value": "Disabled"
      }
    ]
  }
}

RAID

Support for RAID configuration was added in the Mitaka cycle. The raid driver interface provides two clean steps:

  • create_configuration: create RAID configuration
  • delete_configuration: delete all RAID virtual disks

The target RAID configuration must be set in a separate API call prior to cleaning.

{
  "interface": "raid",
  "step": "create_configuration",
  "args": {
    "create_root_volume": true,
    "create_nonroot_volumes": true
  }
}

Of course, support for BIOS and RAID configuration is hardware-dependent.

Limitations

While BIOS and RAID configuration triggered through cleaning can be useful, it has a number of limitations. The configuration is not integrated into Ironic node deployment, so users cannot select a configuration on demand. Cleaning is not available to Nova users, so it is accessible only to administrators. Finally, the requirement for a separate API call to set the target RAID configuration is quite clunky, and prevents the configuration of RAID in automated cleaning.

With these limitations in mind, let's consider the goals for bespoke bare metal.

Goals

We want to allow a pool of hardware to be applied to various tasks, with an optimal server configuration used for each task. Some examples:

  • A Hadoop node with Just a Bunch of Disks (JBOD)
  • A database server with mirrored & striped disks (RAID 10)
  • A High Performance Computing (HPC) compute node, with tuned BIOS parameters

In order to avoid partitioning our hardware, we want to be able to dynamically configure these things when a bare metal instance is deployed.

We also want to make it cloudy. It should not require administrator privileges, and should be abstracted from hardware specifics. The operator should be able to control what can be configured and who can configure it. We'd also like to use existing interfaces and concepts where possible.

Recap: Scheduling in Nova

Understanding the mechanics of deploy templates requires a reasonable knowledge of how scheduling works in Nova with Ironic. The Placement service was added to Nova in the Newton cycle, and extracted into a separate project in Stein. It provides an API for tracking resource inventory & consumption, with support for both quantitative and qualitative aspects.

Let's start by introducing the key concepts in Placement.

  • A Resource Provider provides an Inventory of resources of different Resource Classes
  • A Resource Provider may be tagged with one or more Traits
  • A Consumer may have an Allocation that consumes some of a Resource Provider’s Inventory

Scheduling Virtual Machines

In the case of Virtual Machines, these concepts map as follows:

  • A Compute Node provides an Inventory of vCPU, Disk & Memory resources
  • A Compute Node may be tagged with one or more Traits
  • An Instance may have an Allocation that consumes some of a Compute Node’s Inventory

A hypervisor with 35GB disk, 5825MB RAM and 4 CPUs might have a resource provider inventory record in Placement accessed via GET /resource_providers/{uuid}/inventories that looks like this:

{
    "inventories": {
        "DISK_GB": {
            "allocation_ratio": 1.0, "max_unit": 35, "min_unit": 1,
            "reserved": 0, "step_size": 1, "total": 35
        },
        "MEMORY_MB": {
            "allocation_ratio": 1.5, "max_unit": 5825, "min_unit": 1,
            "reserved": 512, "step_size": 1, "total": 5825
        },
        "VCPU": {
            "allocation_ratio": 16.0, "max_unit": 4, "min_unit": 1,
            "reserved": 0, "step_size": 1, "total": 4
        }
    },
    "resource_provider_generation": 7
}

Note that the inventory tracks all of a hypervisor's resources, whether they are consumed or not. Allocations track what has been consumed by instances.

Scheduling Bare Metal

The scheduling described above for VMs does not apply cleanly to bare metal. Bare metal nodes are indivisible units, and cannot be shared by multiple instances or overcommitted. They're either in use or not. To resolve this issue, we use Placement slightly differently with Nova and Ironic.

  • A Bare Metal Node provides an Inventory of one unit of a custom resource
  • A Bare Metal Node may be tagged with one or more Traits
  • An Instance may have an Allocation that consumes all of a Bare Metal Node’s Inventory

If we now look at the resource provider inventory record for a bare metal node, it might look like this:

{
    "inventories": {
        "CUSTOM_GOLD": {
            "allocation_ratio": 1.0,
            "max_unit": 1,
            "min_unit": 1,
            "reserved": 0,
            "step_size": 1,
            "total": 1
        }
    },
    "resource_provider_generation": 1
}

We have just one unit of one resource class, in this case CUSTOM_GOLD. The resource class comes from the resource_class field of the node in Ironic, upper-cased, and with a prefix of CUSTOM_ to denote that it is a custom resource class as opposed to a standard one like VCPU.

What sort of Nova flavor would be required to schedule to this node?

openstack flavor show bare-metal-gold -f json \
    -c name -c ram -c properties -c vcpus -c disk
{
  "name": "bare-metal-gold",
  "vcpus": 4,
  "ram": 4096,
  "disk": 1024,
  "properties": "resources:CUSTOM_GOLD='1',
                 resources:DISK_GB='0',
                 resources:MEMORY_MB='0',
                 resources:VCPU='0'"
}

Note that the standard fields (vcpus etc.) may be specified for informational purposes, but should be zeroed out using properties as shown.

Traits

So far we have covered scheduling based on quantitative resources. Placement uses traits to model qualitative resources. These are associated with resource providers. For example, we might query GET /resource_providers/{uuid}/traits for a resource provider that has an FPGA to find some information about the class of the FPGA device.

{
    "resource_provider_generation": 1,
    "traits": [
        "CUSTOM_HW_FPGA_CLASS1",
        "CUSTOM_HW_FPGA_CLASS3"
    ]
}

Ironic nodes can have traits assigned to them, in addition to their resource class: GET /nodes/{uuid}?fields=name,resource_class,traits:

{
  "Name": "gold-node-1",
  "Resource Class": "GOLD",
  "Traits": [
    "CUSTOM_RAID0",
    "CUSTOM_RAID1",
  ]
}

Similarly to quantitative scheduling, traits may be specified via a flavor when creating an instance.

openstack flavor show bare-metal-gold -f json -c name -c properties
{
  "name": "bare-metal-gold",
  "properties": "resources:CUSTOM_GOLD='1',
                 resources:DISK_GB='0',
                 resources:MEMORY_MB='0',
                 resources:VCPU='0',
                 trait:CUSTOM_RAID0='required'"
}

This flavor will select bare metal nodes with a resource_class of CUSTOM_GOLD, and a list of traits including CUSTOM_RAID0.

To allow ironic to take action based upon the requested traits, the list of required traits are stored in the Ironic node object under the instance_info field.

Ironic deploy steps

The Ironic deploy steps framework was added in the Rocky cycle as a first step towards making the deployment process more flexible. It is based on the clean step model described earlier, and allows drivers to define steps available to be executed during deployment. Here is the simplified state diagram we saw earlier, this time highlighting the deploying state in which deploy steps are executed.

Ironic deployment states (simplified)

Each deploy step has:

  • interface: One of deploy, power, management, bios, raid
  • step: Method (function) name on the driver interface
  • args: Dictionary of keyword arguments
  • priority: Order of execution (higher runs earlier)

Notice that this is the same as for clean steps.

The mega step

In the Rocky cycle, the majority of the deployment process was moved to a single step called deploy on the deploy interface with a priority of 100. This step roughly does the following:

  • power on the node to boot up the agent
  • wait for the agent to boot
  • write the image to disk
  • power off
  • unplug from provisioning networks
  • plug tenant networks
  • set boot mode
  • power on

Drivers can currently add steps before or after this step. The plan is to split this into multiple core steps for more granular control over the deployment process.

Limitations

Deploy steps are static for a given set of driver interfaces, and are currently all out of band - it is not possible to execute steps on the deployment agent. Finally, the mega step limits ordering of the steps.

Ironic deploy templates

The Ironic deploy templates API was added in the Stein cycle and allows deployment templates to be registered which have:

  • a name, which must be a valid trait
  • a list of deployment steps

For example, a deploy template could be registered via POST /v1/deploy_templates:

{
    "name": "CUSTOM_HYPERTHREADING_ON",
    "steps": [
        {
            "interface": "bios",
            "step": "apply_configuration",
            "args": {
                "settings": [
                    {
                        "name": "LogicalProc",
                        "value": "Enabled"
                    }
                ]
            },
            "priority": 150
        }
    ]
}

This template has a name of CUSTOM_HYPERTHREADING_ON (which is also a valid trait name), and references a deploy step on the bios interface that sets the LogicalProc BIOS setting to Enabled in order to enable Hyperthreading on a node.

Tomorrow’s RAID

In the Stein release we have the deploy templates and steps frameworks, but lack drivers with deploy step implementations to make this useful. As part of the demo for the Bespoke Bare Metal talk, we built and demoed a proof of concept deploy step for configuring RAID during deployment on Dell machines. This code has been polished and is working its way upstream at the time of writing, and has also influenced deploy steps for the HP iLO driver. Thanks to Shivanand Tendulker for extracting and polishing some of the code from the PoC.

We now have an apply_configuration deploy step available on the RAID interface which accepts RAID configuration as an argument, to avoid the separate API call required in cleaning.

The first pass at implementing this in the iDRAC driver took over 30 minutes to complete deployment. This was streamlined to just over 10 minutes by combining deletion and creation of virtual disks into a single deploy step, and avoiding an unnecessary reboot.

End to end flow

Now we know what a deploy template looks like, how are they used?

First of all, the cloud operator creates deploy templates via the Ironic API to execute deploy steps for allowed actions. In this example, we have a deploy template used to create a 42GB RAID1 virtual disk.

cat << EOF > raid1-steps.json
[
    {
        "interface": "raid",
        "step": "apply_configuration",
        "args": {
            "raid_config": {
                "logical_disks": [
                    {
                        "raid_level": "1",
                        "size_gb": 42,
                        "is_root_volume": true
                    }
                ]
            }
        },
        "priority": 150
    }
]
EOF

openstack baremetal deploy template create \
    CUSTOM_RAID1 \
    --steps raid1-steps.json

Next, the operator creates Nova flavors or Glance images with required traits that reference the names of deploy templates.

openstack flavor create raid1 \
    --property resources:VCPU=0 \
    --property resources:MEMORY_MB=0 \
    --property resources:DISK_GB=0 \
    --property resources:CUSTOM_COMPUTE=1 \
    --property trait:CUSTOM_RAID1=required

Finally, a user creates a bare metal instance using one of these flavors that is accessible to them.

openstack server create \
    --name test \
    --flavor raid1 \
    --image centos7 \
    --network mynet \
    --key-name mykey

What happens? A bare metal node is scheduled by Nova which has all of the required traits from the flavor and/or image. Those traits are then used by Ironic to find deploy templates with matching names, and the deploy steps from those templates are executed in addition to the core step, in an order determined by their priorities. In this case, the RAID apply_configuration deploy step runs before the core step because it has a higher priority.

Future Challenges

There is still work to be done to improve the flexibility of bare metal deployment. We need to split out the mega step. We need to support executing steps in the agent running on the node, which would enable deployment-time use of the software RAID support recently developed by Arne Wiebalck from CERN.

Drivers need to expose more deploy steps for BIOS, RAID and other functions. We should agree on how to handle executing a step multiple times, and all the tricky corner cases involved.

We have discussed the Nova use case here, but we could also make use of deploy steps in standalone mode, by passing a list of steps to execute to the Ironic provision API call, similar to manual cleaning. There is also a spec proposed by Madhuri Kumari which would allow reconfiguring active nodes to do things like tweak BIOS settings without requiring redeployment.

Thanks to everyone who has been involved in designing, developing and reviewing the series of features in Nova and Ironic that got us this far. In particular John Garbutt who proposed the specs for deploy steps and deploy templates, and Ruby Loo who implemented the deploy steps framework.

by Mark Goddard at September 23, 2019 11:00 AM

September 19, 2019

OpenStack Superuser

Unleashing the Open Infrastructure Potentials at OpenInfra Days Vietnam 2019

Hosted in Hanoi and organized by the Vietnam OpenInfra User Group (VOI), Vietnam Internet Association (VIA), and VFOSSA, the second Vietnam OpenInfra Days exceeded expectations by selling out in two weeks and influx of sponsoring offers until one week before the event. Broadening the focus on open infrastructure, the event attracted 300 people to the morning sessions and 500 people to the afternoon (open) sessions. Attendees represented more than 90 companies including telcos, cloud, and mobile application providers, who have been applying open source technologies to run their cloud infrastructure and seeking to unleash the potential to increase flexibility, efficiency, and ease of management.

VOID 2019 morning session and exhibition booths

Structured around container technologies, automation, and security, the agenda featured 25  sessions, including case studies, demos, and tutorials. In their talks, the speakers—solution architects, software architects and DevOps engineers—shared their experiences and best practices while building and running customers’ (and their own infrastructure) using OpenStack, Kubernetes, CI/CD, etc.  Many heated discussions were brought to the breaks and the gala dinner showing immense interest in open infrastructure.

“The event, in general, is a playground for open source developers, particularly in open infrastructure. In addition, through the event we would like to bring real case studies which are happening in the world to Vietnam so that companies in Vietnam who have been applying open source can see the general trend of the world, as well as make them more confident in open source-based product orientation,” said Tuan Huu Luong, one of the founders of OpenInfra User Group Vietnam, in an interview with VTC1, a national broadcaster.

Local news coverage of the Vietnam OpenInfra Days 2019

Though officially being an OSF User Group meetup, the Vietnam OpenInfra Day (VOID) is the largest venue on cloud computing and ICT infrastructure in Vietnam. The second event this year also showed the impact of the Vietnam OpenInfra community in the region with sponsors from Korea, Japan, Singapore, Taiwan and half of the speakers from all over the world. Accordingly, the organization team produced a rich program for speakers, sponsors, and attendees.

A warm welcome to speakers and sponsors was organized at the pre-event party in a local brewery, where discussions and opinions on open infrastructure and trends were exchanged. A five star lunch buffet at the event venue, InterContinental Hanoi, provided a pleasant occasion for attendees to meet up and network. Finally, the gala dinner in an authentic Vietnamese restaurant offered a chance to finally close the OpenInfra discussions and introduce the international friends to the Vietnamese food culture, and of course, the noisy drinking culture “Uong Bia Di, 1-2-3 Zooo!” The photo gallery of the event can be found here.

The Vietnam OpenInfra team is impressed and thankful for the attendance in a large number of the OpenInfra Korea User Group despite the unfinished plan to co-organize the event due to short of time. However, a plan for co-organizing Korea OpenInfra Meetup was worked out during the event and Korean attendees were obviously enjoying the event very much.

Korea OpenInfra User Group at the VOID 2019

Last but not least is to mention that the success of the event is owed to the constant support from the OpenStack Foundation (OSF), which was a silver sponsor this year. Especially, the participant of OSF members in organizing  OpenStack Upstream Institute training in Hanoi following the main event. Ildiko Vancsa, Kendall Nelson and volunteer trainers from Vietnam and Korea User Group delivered a surprisingly fun and productive training day to a new generation of contributors from Vietnam.

OpenStack Upstream Institute Training Hanoi

Time to say goodbye to VOID 2019. See you again at the next VOID, until then we will celebrate open infrastructure community’s achievements with a series of events starting with the Korea Meetup in October (TBD)!

VOID 2019 (left) and OpenStack Upstream Institute (right) organizing teams

The post Unleashing the Open Infrastructure Potentials at OpenInfra Days Vietnam 2019 appeared first on Superuser.

by Trinh Nguyen at September 19, 2019 01:00 AM

September 18, 2019

Adam Spiers

Improving trust in the cloud with OpenStack and AMD SEV

This post contains an exciting announcement, but first I need to provide some context!

Ever heard that joke “the cloud is just someone else’s computer”?

Coffee mug saying "There is no cloud. It's just someone else's computer"

Of course it’s a gross over-simplification, but there’s more than a grain of truth in it. And that raises the question: if your applications are running in someone else’s data-centre, how can you trust that they’re not being snooped upon, or worse, invasively tampered with?

Until recently, the answer was “you can’t”. Well, that’s another over-simplification. You could design your workload to be tamperproof; for example even if individual mining nodes in Bitcoin or Ethereum are compromised, the blockchain as a whole will resist the attack just fine. But there’s still the snooping problem.

Hardware to the rescue?

However, there’s some good news on this front. Intel and AMD realised this was a problem, and have both introduced new hardware capabilities to help improve the level to which cloud users can trust the environment in which their workloads are executed, e.g.:

  • AMD SEV (Secure Encrypted Virtualization) which can encrypt the memory of a running VM with a key which is only accessible to the owner of that VM. This is done on-chip so that even if you have physical access to the machine, it makes it a lot harder to snoop in on the running VM1.

    It can also provide the guest owner with an attestation which cryptographically proves that the memory was encrypted correctly and can only be decrypted by the owner.

  • Intel MKTME (Multi-Key Total Memory Encryption) which is a similar approach.

But even with that hardware support, there is the question to what degree anyone can trust public clouds run on proprietary technology. There is a growing awareness that Free (Libre) / Open Source Software tends to be inherently more secure and trustworthy, since its transparency enables unlimited peer review, and its openness allows anyone to contribute improvements.

And these days, OpenStack is pretty much the undisputed king of the Open Source cloud infrastructure world.

An exciting announcement

So I’m delighted to be able to announce a significant step forward in trustworthy cloud computing: as of this week, OpenStack is now able to launch VMs with SEV enabled! (Given the appropriate AMD hardware, of course.)

The new hw:mem_encryption flavor extra spec

The core functionality is all merged and will be in the imminent Train release. You can read the documentation, and you will also find it mentioned in the Nova Release Notes.

While this is “only” an MVP and far from the end of the journey (see below), it’s an important milestone in a strong partnership between my employer SUSE and AMD. We started work on adding SEV support into OpenStack around a year ago:

The original blueprint for integrating AMD SEV into nova

This resulted in one of the most in-depth technical specification documentations I’ve ever had to write, plus many months of intense collaboration on the code and several changes in design along the way.

SEV code reviews. Click to view in Gerrit!

I’d like to thank not only my colleagues at SUSE and AMD for all their work so far, but also many members of the upstream OpenStack community, especially the Nova team. In particular I enjoyed fantastic support from the PTL (Project Technical Lead) Eric Fried, and several developers at Red Hat, which I think speaks volumes to how well the “coopetition” model works in the Open Source world.

The rest of this post gives a quick tour of the implementation via screenshots and brief explanations, and then concludes with what’s planned next.

OpenStack’s Compute service (nova) will automatically detect the presence of the SEV feature on any compute node which is configured to support it. You can optionally configure how many slots are available on the memory controller for encryption keys. One is used for each guest, so this effectively acts as the maximum number of guest VMs which can concurrently use SEV. Here you can see the configuration of this option, and how nova handles the inventory. Note that it also registers an SEV trait on the compute host, so that in the future if the cloud has a mix of hardware offering different guest memory encryption technologies, you’ll be able to choose which one you want for any given guest, if you need to.

Inventorying the SEV feature.

SEV can be enabled by the operator by adding a new hw:mem_encryption “extra spec” which is a property on nova’s flavors. As already shown in the screenshot above, this can be done through Horizon, OpenStack’s web dashboard. However it can also be set per-image via a similarly-named property hw_mem_encryption:

Enabling SEV via image property in Horizon.

and of course this can all be done via the command-line too:

Enabling SEV via CLI. Click for full size.

Notice the presence of a few other image properties which are crucial for SEV to function correctly. (These are explained fully in the documentation.)

Once booted, an SEV VM instance looks and behaves pretty much like any other OpenStack VM:

SEV instances listed in Horizon

However there are some limitations, e.g. it cannot yet be live-migrated or suspended:

Enabling SEV via flavor extra spec or image property

Behind the scenes, nova takes care of quite a few important details in how the VM is configured in libvirt. Firstly it performs sanity checks on the flavor and image properties. Then it adds a crucial new <launchSecurity> element:

Enabling SEV via flavor extra spec or image property

and also enables IOMMU for virtio devices:

Enabling IOMMU for virtio devices

What’s next?

This area of technology is new and rapidly evolving, so there is still plenty of work left to be done, especially on the software side.

Of course we’ll be adding this functionality to SUSE OpenStack Cloud, initially as a technical preview for our customers to try out.

Probably the most important feature needed next on the SEV side is the ability to verify the attestation which cryptographically proves that the memory was encrypted correctly and can only be decrypted by the owner. In addition specification of the work required to add support to OpenStack for Intel’s MKTME already started, so I would expect that to continue.

Footnotes:

1

There are still potential attacks, e.g. snooping unencrypted memory cache or CPU registers. Work by AMD and others is ongoing to address these.

Share

The post Improving trust in the cloud with OpenStack and AMD SEV appeared first on Structured Procrastination.

by Adam at September 18, 2019 11:32 AM

September 17, 2019

StackHPC Team Blog

Migrating a running OpenStack to containerisation with Kolla

Deploying OpenStack infrastructures with containers brings many operational benefits, such as isolation of dependencies and repeatability of deployment, in particular when coupled with a CI/CD approach. The Kolla project provides tooling that helps deploy and operate containerised OpenStack deployments. Configuring a new OpenStack cloud with Kolla containers is well documented and can benefit from the sane defaults provided by the highly opinionated Kolla Ansible subproject. However, migrating existing OpenStack deployments to Kolla containers can require a more ad hoc approach, particularly to minimise impact on end users.

We recently helped an organization migrate an existing OpenStack Queens production deployment to a containerised solution using Kolla and Kayobe, a subproject designed to simplify the provisioning and configuration of bare-metal nodes. This blog post describes the migration strategy we adopted in order to reduce impact on end users and shares what we learned in the process.

Existing OpenStack deployment

The existing cloud was running the OpenStack Queens release deployed using CentOS RPM packages. This cloud was managed by a control plane of 16 nodes, with each service deployed over two (for OpenStack services) or three (for Galera and RabbitMQ) servers for high availability. Around 40 hypervisor nodes from different generations of hardware were available, resulting in a heterogeneous mix of CPU models, amount of RAM, and even network interface names (with some nodes using onboard Ethernet interfaces and others using PCI cards).

A separate Ceph cluster was used as a backend for all OpenStack services requiring large amounts of storage: Glance, Cinder, Gnocchi, and also disks of Nova instances (i.e. none of the user data was stored on hypervisors).

A new infrastructure

With a purchase of new control plane hardware also being planned, we advised the following configuration, based on our experience and recommendations from Kolla Ansible:

  • three controller nodes hosting control services like APIs and databases, using an odd number for quorum
  • two network nodes hosting Neutron agents along with HAProxy / Keepalived
  • three monitoring nodes providing centralized logging, metrics collection and alerting, a feature which was critically lacking from the existing deployment

Our goal was to migrate the entire OpenStack deployment to use Kolla containers and be managed by Kolla Ansible and Kayobe, with control services running on the new control plane hardware and hypervisors reprovisioned and reconfigured, with little impact on users and their workflows.

Migration strategy

Using a small-scale candidate environment, we developed our migration strategy. The administrators of the infrastructure would install CentOS 7 on the new control plane, using their existing provisioning system, Foreman. We would configure the host OS of the new nodes with Kayobe to make them ready to deploy Kolla containers: configure multiple VLAN interfaces and networks, create LVM volumes, install Docker, etc.

We would then deploy OpenStack services on this control plane. To reduce the risk of the migration, our strategy was to progressively reconfigure the load balancers to point to the new controllers for each OpenStack service while validating that they were not causing errors. If any issue arose, we would be able to quickly revert to the API services running on the original control plane. Fresh Galera, Memcached, and RabbitMQ clusters would also be set up on the new controllers, although the existing ones would remain in use by the OpenStack services for now. We would then gradually shut down the original services after making sure that all resources are managed by the new OpenStack services.

Then, during a scheduled downtime, we would copy the content of the SQL database, reconfigure all services (on the control plane and also on hypervisors) to use the new Galera, Memcached, and RabbitMQ clusters, and move the virtual IP of the load balancer over to the new network nodes, where HAProxy and Keepalived would be deployed.

The animation below depicts the process of migrating from the original to the new control plane, with only a subset of the services displayed for clarity.

Migration from the original to the new control plane

Finally, we would use live migration to free up several hypervisors, redeploy OpenStack services on them after reprovisioning, and live migrate virtual machines back on them. The animation below shows the transition of hypervisors to Kolla:

Migration of hypervisors to Kolla

Tips & Tricks

Having described the overall migration strategy, we will now cover tasks that required special care and provide tips for operators who would like to follow the same approach.

Translating the configuration

In order to make the migration seamless, we wanted to keep the configuration of services deployed on the new control plane as close as possible to the original configuration. In some cases, this meant moving away from Kolla Ansible's sane defaults and making use of its extensive customisation capabilities. In this section, we describe how to integrate an existing configuration into Kolla Ansible.

The original configuration management tool kept entire OpenStack configuration files under source control, with unique values templated using Jinja. The existing deployment had been upgraded several times, and configuration files had not been updated with deprecation and removal of some configuration options. In comparison, Kolla Ansible uses a layered approach where configuration generated by Kolla Ansible itself is merged with additions or overrides specified by the operator either globally, per role (nova), per service (nova-api), or per host (hypervisor042). This has the advantage of reducing the amount of configuration to check at each upgrade, since Kolla Ansible will track deprecation and removals of the options it uses.

The oslo-config-validator tool from the oslo.config project helps with the task of auditing an existing configuration for outdated options. While introduced in Stein, it may be possible to run it against older releases if the API has not changed substantially. For example, to audit nova.conf using code from the stable/queens branch:

$ git clone -b stable/queens https://opendev.org/openstack/nova.git
$ cd nova
$ tox -e venv -- pip install --upgrade oslo.config # Update to the latest oslo.config release
$ tox -e venv -- oslo-config-validator --config-file etc/nova/nova-config-generator.conf --input-file /etc/nova/nova.conf

This would output messages identifying removed and deprecated options:

ERROR:root:DEFAULT/verbose not found
WARNING:root:Deprecated opt DEFAULT/notify_on_state_change found
WARNING:root:Deprecated opt DEFAULT/notification_driver found
WARNING:root:Deprecated opt DEFAULT/auth_strategy found
WARNING:root:Deprecated opt DEFAULT/scheduler_default_filters found

Once updated to match the deployed release, all the remaining options could be moved to a role configuration file used by for Kolla Ansible. However, we preferred to audit each one against Kolla Ansible templates, such as nova.conf.j2, to avoid keeping redundant options and detect any potential conflicts. Future upgrades will be made easier by reducing the amount of custom configuration compared to Kolla Ansible's defaults.

Templating also needs to be adapted from the original configuration management system. Kolla Ansible relies on Jinja which can use variables set in Ansible. However, when called from Kayobe, extra group variables cannot be set in Kolla Ansible's inventory, so instead of cpu_allocation_ratio = {{ cpu_allocation_ratio }} you would have to use a different approach:

{% if inventory_hostname in groups['compute_big_overcommit'] %}
cpu_allocation_ratio = 16.0
{% elif inventory_hostname in groups['compute_small_overcommit'] %}
cpu_allocation_ratio = 4.0
{% else %}
cpu_allocation_ratio = 1.0
{% endif %}

Configuring Kolla Ansible to use existing services

We described earlier that our migration strategy was to progressively deploy OpenStack services on the new control plane while using the existing Galera, Memcached, and RabbitMQ clusters. This section explains how this can be configured with Kayobe and Kolla Ansible.

In Kolla Ansible, many deployment settings are configured in ansible/group_vars/all.yml, including the RabbitMQ transport URL (rpc_transport_url) and the database connection (database_address).

An operator can override these values from Kayobe using etc/kayobe/kolla/globals.yml:

rpc_transport_url: rabbit://username:password@ctrl01:5672,username:password@ctrl02:5672,username:password@ctrl03:5672

Another approach is to populate the groups that Kolla Ansible uses to generate these variables. In Kayobe, we can create an extra group for each existing service (e.g. ctrl_rabbitmq), populate it with existing hosts, and customise the Kolla Ansible inventory to map services to them.

In etc/kayobe/kolla.yml:

kolla_overcloud_inventory_top_level_group_map:
  control:
    groups:
      - controllers
  network:
    groups:
      - network
  compute:
    groups:
      - compute
  monitoring:
    groups:
      - monitoring
  storage:
    groups:
      "{{ kolla_overcloud_inventory_storage_groups }}"
  ctrl_rabbitmq:
    groups:
      - ctrl_rabbitmq

kolla_overcloud_inventory_custom_components: "{{ lookup('template', kayobe_config_path ~ '/kolla/inventory/overcloud-components.j2') }}"

In etc/kayobe/inventory/hosts:

[ctrl_rabbitmq]
ctrl01 ansible_host=192.168.0.1
ctrl02 ansible_host=192.168.0.2
ctrl03 ansible_host=192.168.0.3

We copy overcloud-components.j2 from the Kayobe source tree to etc/kayobe/kolla/inventory/overcloud-components.j2 in our kayobe-config repository and customise it:

[rabbitmq:children]
ctrl_rabbitmq

[outward-rabbitmq:children]
ctrl_rabbitmq

While better integrated with Kolla Ansible, this approach should be used with care so that the original control plane is not reconfigured in the process. Operators can use the --limit and --kolla-limit options of Kayobe to restrict Ansible playbooks to specific groups or hosts.

Customising Kolla images

Even though Kolla Ansible can be configured extensively, it is sometimes required to customise Kolla images. For example, we had to rebuild the heat-api container image so it would use a different Keystone domain name: Kolla uses heat_user_domain while the existing deployment used heat.

Once a modification has been pushed to the Kolla repository configured to be pulled by Kayobe, one can simply rebuild images with the kayobe overcloud container image build command.

Deploying services on the new control plane

Before deploying services on the new control plane, it can be useful to double-check that our configuration is correct. Kayobe can generate the configuration used by Kolla Ansible with the following command:

$ kayobe overcloud service configuration generate --node-config-dir /tmp/kolla

To deploy only specific services, the operator can restrict Kolla Ansible to specific roles using tags:

$ kayobe overcloud service deploy --kolla-tags glance

Migrating resources to new services

Most OpenStack services will start managing existing resources immediately after deployment. However, a few require manual intervention from the operator to perform the transition, particularly when services are not configured for high availability.

Cinder

Even when volume data is kept on a distributed backend like a Ceph cluster, each volume can be associated with a specific cinder-volume service. The service can be identified from the os-vol-host-attr:host field in the output of openstack volume show.

$ openstack volume show <volume_uuid> -c os-vol-host-attr:host -f value
ctrl01@rbd

There is a cinder-manage command that can be used to migrate volumes from one cinder-volume service to another:

$ cinder-manage volume update_host --currenthost ctrl01@rbd --newhost newctrl01@rbd

However there is no command to migrate specific volumes only, so if you are migrating to a bigger number of cinder-volume services, some will have have no volume to manage until the Cinder scheduler allocate new volumes on them.

Do not confuse this command with cinder migrate which is designed to transfer volume data between different backends. Be advised that when the destination is a cinder-volume service using the same Ceph backend, it will happily delete your volume data!

Neutron

Unless Layer 3 High Availability is configured in Neutron, routers will be assigned to a specific neutron-l3-agent service. The existing service can be replaced with the commands:

$ openstack network agent remove router --l3 <old-agent-uuid> <router-uuid>
$ openstack network agent add router --l3 <new-agent-uuid> <router-uuid>

Similarly, you can use the openstack network agent remove network --dhcp and openstack network agent add network --dhcp commands for DHCP agents.

Live migrating instances

In addition to the new control plane, several additional compute hosts were added to the system, in order to provide free resources that could host the first batch of live migrated instances. Once configured as Nova hypervisors, we discovered that we could not migrate instances to them because CPU flags didn't match, even though source hypervisors were using the same hardware.

This was caused by a mismatch in BIOS versions: the existing hypervisors in production had been updated to the latest BIOS to protect against the Spectre and Meltdown vulnerabilities, but these new hypervisors had not, resulting in different CPU flags.

This is a good reminder that in a heterogeneous infrastructure, operators should check the cpu_mode used by Nova. Kashyap Chamarthy's talk on effective virtual CPU configuration in Nova gives a good overview of available options.

What about downtime?

While we wanted to minimize the impact on end users and their workflow, there were no critical services running on this cloud that would have needed a zero downtime approach. If it had been a requirement, we would have explored dynamically added new control plane nodes to the existing clusters before removing the old ones. Instead, it was a welcome opportunity to reinitialize the configuration of several critical components to a clean slate.

The road ahead

This OpenStack deployment is now ready to benefit from all the improvements developed by the Kolla community, which released Kolla 8.0.0 and Kolla Ansible 8.0.0 for the Stein cycle earlier this summer and Kayobe 6.0.0 at the end of August. The community is now actively working on releases for OpenStack Train.

If you would like to get in touch we would love to hear from you. Reach out to us via Twitter or directly via our contact page.

by Pierre Riteau at September 17, 2019 02:00 PM

September 16, 2019

OpenStack Superuser

Must-see Containers Sessions at the Open Infrastructure Summit Shanghai

Join the open source community at the Open Infrastructure Summit Shanghai. The Summit schedule features over 100 sessions organized by use cases including:  container infrastructure, artificial intelligence and machine learning, high performance computing, 5G, edge computing, network functions virtualization, and public, private and multi-cloud strategies. 

Here we’re highlighting some of the sessions you don’t want to miss about container infrastructure. Check out the full list of sessions from this track here

Kata Containers: a Cornerstone for Financial Grade Cloud Native Infrastructure

In 2017, the Kata Containers project was formed out of the code bases contributed by Intel Clear Containers and Hyper.SH runV. A year and a half later, the OpenStack Foundation confirmed Kata Containers as a top level OpenInfra project and became a de facto standard of open source virtualized container technology. Meanwhile, Hyper.sh joined forces with Ant Financial to build the CloudNative infrastructure for financial services based on secure containers. 

During this session, Ant Financial’s Xu Wang will focus on an introduction to Kata Containers and AntFin’s secure containers practice. 

Keystone as The Authentication Center for OpenStack and Kubernetes

With the increase of container services, cloud platforms cannot meet the needs of customers due to only providing virtual machine and bare metal services. Customers need to be able to consume all three services so a unified user management and authentication system is a necessity. H3C Technologies decided to use Keystone as their user management and authentication service. Jun Gu and James Xu from H3C will cover the following topics during this session: 

  1. Introduction to Keystone
  2. User management and authentication for K8s & OpenStack
  3. Keystone enhancement
  4. Integrated with third parties

Run Kubernetes on OpenStack and Bare Metal fast

Running Kubernetes on top of OpenStack provides high levels of automation and scalability. Kuryr is an OpenStack project that is a CNI plugin using Neutron and Octavia to provide networking for pods and services being primarily designed for Kubernetes clusters running on OpenStack machines. 

Tests were performed to check how Kuryr increases the networking performance when running Kubernetes on OpenStack when compared to using OpenShift/OVS SDN. In this session, Ramon Acedo Rodriquez from Red Hat will update the latest integrations and architecture to run Kubernetes clusters on OpenStack and bare metal. In addition, Rodriquez will discuss aspects of performance improvements using Kuryr as the SDN by showing his test results. 

 

Join the global community, November 4-6 in Shanghai for these sessions and more that can help you create a strategy to solve your organization’s container infrastructure needs.

The post Must-see Containers Sessions at the Open Infrastructure Summit Shanghai appeared first on Superuser.

by Kendall Waters at September 16, 2019 02:43 PM

Nate Johnston

Calendar Merge

I work in the OpenStack community, which is a broad confederation of many teams working on projects that together compose an open source IaaS cloud. With a project of such magnitude, there are a lot of meetings, which in the OpenStack world take place on Freenode IRC. The OpenStack community has set up an automated system to schedule, manage, and log these meetings. You can see the web front end at Eavesdrop.

September 16, 2019 01:34 AM

About

Planet OpenStack is a collection of thoughts from the developers and other key players of the OpenStack projects. If you are working on OpenStack technology you should add your OpenStack blog.

Subscriptions

Last updated:
November 14, 2019 08:07 AM
All times are UTC.

Powered by:
Planet