September 19, 2019

OpenStack Superuser

Unleashing the Open Infrastructure Potentials at OpenInfra Days Vietnam 2019

Hosted in Hanoi and organized by the Vietnam OpenInfra User Group (VOI), Vietnam Internet Association (VIA), and VFOSSA, the second Vietnam OpenInfra Days exceeded expectations by selling out in two weeks and influx of sponsoring offers until one week before the event. Broadening the focus on open infrastructure, the event attracted 300 people to the morning sessions and 500 people to the afternoon (open) sessions. Attendees represented more than 90 companies including telcos, cloud, and mobile application providers, who have been applying open source technologies to run their cloud infrastructure and seeking to unleash the potential to increase flexibility, efficiency, and ease of management.

VOID 2019 morning session and exhibition booths

Structured around container technologies, automation, and security, the agenda featured 25  sessions, including case studies, demos, and tutorials. In their talks, the speakers—solution architects, software architects and DevOps engineers—shared their experiences and best practices while building and running customers’ (and their own infrastructure) using OpenStack, Kubernetes, CI/CD, etc.  Many heated discussions were brought to the breaks and the gala dinner showing immense interest in open infrastructure.

“The event, in general, is a playground for open source developers, particularly in open infrastructure. In addition, through the event we would like to bring real case studies which are happening in the world to Vietnam so that companies in Vietnam who have been applying open source can see the general trend of the world, as well as make them more confident in open source-based product orientation,” said Tuan Huu Luong, one of the founders of OpenInfra User Group Vietnam, in an interview with VTC1, a national broadcaster.

Local news coverage of the Vietnam OpenInfra Days 2019

Though officially being an OSF User Group meetup, the Vietnam OpenInfra Day (VOID) is the largest venue on cloud computing and ICT infrastructure in Vietnam. The second event this year also showed the impact of the Vietnam OpenInfra community in the region with sponsors from Korea, Japan, Singapore, Taiwan and half of the speakers from all over the world. Accordingly, the organization team produced a rich program for speakers, sponsors, and attendees.

A warm welcome to speakers and sponsors was organized at the pre-event party in a local brewery, where discussions and opinions on open infrastructure and trends were exchanged. A five star lunch buffet at the event venue, InterContinental Hanoi, provided a pleasant occasion for attendees to meet up and network. Finally, the gala dinner in an authentic Vietnamese restaurant offered a chance to finally close the OpenInfra discussions and introduce the international friends to the Vietnamese food culture, and of course, the noisy drinking culture “Uong Bia Di, 1-2-3 Zooo!” The photo gallery of the event can be found here.

The Vietnam OpenInfra team is impressed and thankful for the attendance in a large number of the OpenInfra Korea User Group despite the unfinished plan to co-organize the event due to short of time. However, a plan for co-organizing Korea OpenInfra Meetup was worked out during the event and Korean attendees were obviously enjoying the event very much.

Korea OpenInfra User Group at the VOID 2019

Last but not least is to mention that the success of the event is owed to the constant support from the OpenStack Foundation (OSF), which was a silver sponsor this year. Especially, the participant of OSF members in organizing  OpenStack Upstream Institute training in Hanoi following the main event. Ildiko Vancsa, Kendall Nelson and volunteer trainers from Vietnam and Korea User Group delivered a surprisingly fun and productive training day to a new generation of contributors from Vietnam.

OpenStack Upstream Institute Training Hanoi

Time to say goodbye to VOID 2019. See you again at the next VOID, until then we will celebrate open infrastructure community’s achievements with a series of events starting with the Korea Meetup in October (TBD)!

VOID 2019 (left) and OpenStack Upstream Institute (right) organizing teams

The post Unleashing the Open Infrastructure Potentials at OpenInfra Days Vietnam 2019 appeared first on Superuser.

by Trinh Nguyen at September 19, 2019 01:00 AM

September 18, 2019

Adam Spiers

Improving trust in the cloud with OpenStack and AMD SEV

This post contains an exciting announcement, but first I need to provide some context!

Ever heard that joke “the cloud is just someone else’s computer”?

Coffee mug saying "There is no cloud. It's just someone else's computer"

Of course it’s a gross over-simplification, but there’s more than a grain of truth in it. And that raises the question: if your applications are running in someone else’s data-centre, how can you trust that they’re not being snooped upon, or worse, invasively tampered with?

Until recently, the answer was “you can’t”. Well, that’s another over-simplification. You could design your workload to be tamperproof; for example even if individual mining nodes in Bitcoin or Ethereum are compromised, the blockchain as a whole will resist the attack just fine. But there’s still the snooping problem.

Hardware to the rescue?

However, there’s some good news on this front. Intel and AMD realised this was a problem, and have both introduced new hardware capabilities to help improve the level to which cloud users can trust the environment in which their workloads are executed, e.g.:

  • AMD SEV (Secure Encrypted Virtualization) which can encrypt the memory of a running VM with a key which is only accessible to the owner of that VM. This is done on-chip so that even if you have physical access to the machine, it makes it a lot harder to snoop in on the running VM1.

    It can also provide the guest owner with an attestation which cryptographically proves that the memory was encrypted correctly and can only be decrypted by the owner.

  • Intel MKTME (Multi-Key Total Memory Encryption) which is a similar approach.

But even with that hardware support, there is the question to what degree anyone can trust public clouds run on proprietary technology. There is a growing awareness that Free (Libre) / Open Source Software tends to be inherently more secure and trustworthy, since its transparency enables unlimited peer review, and its openness allows anyone to contribute improvements.

And these days, OpenStack is pretty much the undisputed king of the Open Source cloud infrastructure world.

An exciting announcement

So I’m delighted to be able to announce a significant step forward in trustworthy cloud computing: as of this week, OpenStack is now able to launch VMs with SEV enabled! (Given the appropriate AMD hardware, of course.)

The new hw:mem_encryption flavor extra spec

The core functionality is all merged and will be in the imminent Train release. You can read the documentation, and you will also find it mentioned in the Nova Release Notes.

While this is “only” an MVP and far from the end of the journey (see below), it’s an important milestone in a strong partnership between my employer SUSE and AMD. We started work on adding SEV support into OpenStack around a year ago:

The original blueprint for integrating AMD SEV into nova

This resulted in one of the most in-depth technical specification documentations I’ve ever had to write, plus many months of intense collaboration on the code and several changes in design along the way.

SEV code reviews. Click to view in Gerrit!

I’d like to thank not only my colleagues at SUSE and AMD for all their work so far, but also many members of the upstream OpenStack community, especially the Nova team. In particular I enjoyed fantastic support from the PTL (Project Technical Lead) Eric Fried, and several developers at Red Hat, which I think speaks volumes to how well the “coopetition” model works in the Open Source world.

The rest of this post gives a quick tour of the implementation via screenshots and brief explanations, and then concludes with what’s planned next.

OpenStack’s Compute service (nova) will automatically detect the presence of the SEV feature on any compute node which is configured to support it. You can optionally configure how many slots are available on the memory controller for encryption keys. One is used for each guest, so this effectively acts as the maximum number of guest VMs which can concurrently use SEV. Here you can see the configuration of this option, and how nova handles the inventory. Note that it also registers an SEV trait on the compute host, so that in the future if the cloud has a mix of hardware offering different guest memory encryption technologies, you’ll be able to choose which one you want for any given guest, if you need to.

Inventorying the SEV feature.

SEV can be enabled by the operator by adding a new hw:mem_encryption “extra spec” which is a property on nova’s flavors. As already shown in the screenshot above, this can be done through Horizon, OpenStack’s web dashboard. However it can also be set per-image via a similarly-named property hw_mem_encryption:

Enabling SEV via image property in Horizon.

and of course this can all be done via the command-line too:

Enabling SEV via CLI. Click for full size.

Notice the presence of a few other image properties which are crucial for SEV to function correctly. (These are explained fully in the documentation.)

Once booted, an SEV VM instance looks and behaves pretty much like any other OpenStack VM:

SEV instances listed in Horizon

However there are some limitations, e.g. it cannot yet be live-migrated or suspended:

Enabling SEV via flavor extra spec or image property

Behind the scenes, nova takes care of quite a few important details in how the VM is configured in libvirt. Firstly it performs sanity checks on the flavor and image properties. Then it adds a crucial new <launchSecurity> element:

Enabling SEV via flavor extra spec or image property

and also enables IOMMU for virtio devices:

Enabling IOMMU for virtio devices

What’s next?

This area of technology is new and rapidly evolving, so there is still plenty of work left to be done, especially on the software side.

Of course we’ll be adding this functionality to SUSE OpenStack Cloud, initially as a technical preview for our customers to try out.

Probably the most important feature needed next on the SEV side is the ability to verify the attestation which cryptographically proves that the memory was encrypted correctly and can only be decrypted by the owner. In addition specification of the work required to add support to OpenStack for Intel’s MKTME already started, so I would expect that to continue.

Footnotes:

1

There are still potential attacks, e.g. snooping unencrypted memory cache or CPU registers. Work by AMD and others is ongoing to address these.

Share

The post Improving trust in the cloud with OpenStack and AMD SEV appeared first on Structured Procrastination.

by Adam at September 18, 2019 11:32 AM

September 17, 2019

StackHPC Team Blog

Migrating a running OpenStack to containerisation with Kolla

Deploying OpenStack infrastructures with containers brings many operational benefits, such as isolation of dependencies and repeatability of deployment, in particular when coupled with a CI/CD approach. The Kolla project provides tooling that helps deploy and operate containerised OpenStack deployments. Configuring a new OpenStack cloud with Kolla containers is well documented and can benefit from the sane defaults provided by the highly opinionated Kolla Ansible subproject. However, migrating existing OpenStack deployments to Kolla containers can require a more ad hoc approach, particularly to minimise impact on end users.

We recently helped an organization migrate an existing OpenStack Queens production deployment to a containerised solution using Kolla and Kayobe, a subproject designed to simplify the provisioning and configuration of bare-metal nodes. This blog post describes the migration strategy we adopted in order to reduce impact on end users and shares what we learned in the process.

Existing OpenStack deployment

The existing cloud was running the OpenStack Queens release deployed using CentOS RPM packages. This cloud was managed by a control plane of 16 nodes, with each service deployed over two (for OpenStack services) or three (for Galera and RabbitMQ) servers for high availability. Around 40 hypervisor nodes from different generations of hardware were available, resulting in a heterogeneous mix of CPU models, amount of RAM, and even network interface names (with some nodes using onboard Ethernet interfaces and others using PCI cards).

A separate Ceph cluster was used as a backend for all OpenStack services requiring large amounts of storage: Glance, Cinder, Gnocchi, and also disks of Nova instances (i.e. none of the user data was stored on hypervisors).

A new infrastructure

With a purchase of new control plane hardware also being planned, we advised the following configuration, based on our experience and recommendations from Kolla Ansible:

  • three controller nodes hosting control services like APIs and databases, using an odd number for quorum
  • two network nodes hosting Neutron agents along with HAProxy / Keepalived
  • three monitoring nodes providing centralized logging, metrics collection and alerting, a feature which was critically lacking from the existing deployment

Our goal was to migrate the entire OpenStack deployment to use Kolla containers and be managed by Kolla Ansible and Kayobe, with control services running on the new control plane hardware and hypervisors reprovisioned and reconfigured, with little impact on users and their workflows.

Migration strategy

Using a small-scale candidate environment, we developed our migration strategy. The administrators of the infrastructure would install CentOS 7 on the new control plane, using their existing provisioning system, Foreman. We would configure the host OS of the new nodes with Kayobe to make them ready to deploy Kolla containers: configure multiple VLAN interfaces and networks, create LVM volumes, install Docker, etc.

We would then deploy OpenStack services on this control plane. To reduce the risk of the migration, our strategy was to progressively reconfigure the load balancers to point to the new controllers for each OpenStack service while validating that they were not causing errors. If any issue arose, we would be able to quickly revert to the API services running on the original control plane. Fresh Galera, Memcached, and RabbitMQ clusters would also be set up on the new controllers, although the existing ones would remain in use by the OpenStack services for now. We would then gradually shut down the original services after making sure that all resources are managed by the new OpenStack services.

Then, during a scheduled downtime, we would copy the content of the SQL database, reconfigure all services (on the control plane and also on hypervisors) to use the new Galera, Memcached, and RabbitMQ clusters, and move the virtual IP of the load balancer over to the new network nodes, where HAProxy and Keepalived would be deployed.

The animation below depicts the process of migrating from the original to the new control plane, with only a subset of the services displayed for clarity.

Migration from the original to the new control plane

Finally, we would use live migration to free up several hypervisors, redeploy OpenStack services on them after reprovisioning, and live migrate virtual machines back on them. The animation below shows the transition of hypervisors to Kolla:

Migration of hypervisors to Kolla

Tips & Tricks

Having described the overall migration strategy, we will now cover tasks that required special care and provide tips for operators who would like to follow the same approach.

Translating the configuration

In order to make the migration seamless, we wanted to keep the configuration of services deployed on the new control plane as close as possible to the original configuration. In some cases, this meant moving away from Kolla Ansible's sane defaults and making use of its extensive customisation capabilities. In this section, we describe how to integrate an existing configuration into Kolla Ansible.

The original configuration management tool kept entire OpenStack configuration files under source control, with unique values templated using Jinja. The existing deployment had been upgraded several times, and configuration files had not been updated with deprecation and removal of some configuration options. In comparison, Kolla Ansible uses a layered approach where configuration generated by Kolla Ansible itself is merged with additions or overrides specified by the operator either globally, per role (nova), per service (nova-api), or per host (hypervisor042). This has the advantage of reducing the amount of configuration to check at each upgrade, since Kolla Ansible will track deprecation and removals of the options it uses.

The oslo-config-validator tool from the oslo.config project helps with the task of auditing an existing configuration for outdated options. While introduced in Stein, it may be possible to run it against older releases if the API has not changed substantially. For example, to audit nova.conf using code from the stable/queens branch:

$ git clone -b stable/queens https://opendev.org/openstack/nova.git
$ cd nova
$ tox -e venv -- pip install --upgrade oslo.config # Update to the latest oslo.config release
$ tox -e venv -- oslo-config-validator --config-file etc/nova/nova-config-generator.conf --input-file /etc/nova/nova.conf

This would output messages identifying removed and deprecated options:

ERROR:root:DEFAULT/verbose not found
WARNING:root:Deprecated opt DEFAULT/notify_on_state_change found
WARNING:root:Deprecated opt DEFAULT/notification_driver found
WARNING:root:Deprecated opt DEFAULT/auth_strategy found
WARNING:root:Deprecated opt DEFAULT/scheduler_default_filters found

Once updated to match the deployed release, all the remaining options could be moved to a role configuration file used by for Kolla Ansible. However, we preferred to audit each one against Kolla Ansible templates, such as nova.conf.j2, to avoid keeping redundant options and detect any potential conflicts. Future upgrades will be made easier by reducing the amount of custom configuration compared to Kolla Ansible's defaults.

Templating also needs to be adapted from the original configuration management system. Kolla Ansible relies on Jinja which can use variables set in Ansible. However, when called from Kayobe, extra group variables cannot be set in Kolla Ansible's inventory, so instead of cpu_allocation_ratio = {{ cpu_allocation_ratio }} you would have to use a different approach:

{% if inventory_hostname in groups['compute_big_overcommit'] %}
cpu_allocation_ratio = 16.0
{% elif inventory_hostname in groups['compute_small_overcommit'] %}
cpu_allocation_ratio = 4.0
{% else %}
cpu_allocation_ratio = 1.0
{% endif %}

Configuring Kolla Ansible to use existing services

We described earlier that our migration strategy was to progressively deploy OpenStack services on the new control plane while using the existing Galera, Memcached, and RabbitMQ clusters. This section explains how this can be configured with Kayobe and Kolla Ansible.

In Kolla Ansible, many deployment settings are configured in ansible/group_vars/all.yml, including the RabbitMQ transport URL (rpc_transport_url) and the database connection (database_address).

An operator can override these values from Kayobe using etc/kayobe/kolla/globals.yml:

rpc_transport_url: rabbit://username:password@ctrl01:5672,username:password@ctrl02:5672,username:password@ctrl03:5672

Another approach is to populate the groups that Kolla Ansible uses to generate these variables. In Kayobe, we can create an extra group for each existing service (e.g. ctrl_rabbitmq), populate it with existing hosts, and customise the Kolla Ansible inventory to map services to them.

In etc/kayobe/kolla.yml:

kolla_overcloud_inventory_top_level_group_map:
  control:
    groups:
      - controllers
  network:
    groups:
      - network
  compute:
    groups:
      - compute
  monitoring:
    groups:
      - monitoring
  storage:
    groups:
      "{{ kolla_overcloud_inventory_storage_groups }}"
  ctrl_rabbitmq:
    groups:
      - ctrl_rabbitmq

kolla_overcloud_inventory_custom_components: "{{ lookup('template', kayobe_config_path ~ '/kolla/inventory/overcloud-components.j2') }}"

In etc/kayobe/inventory/hosts:

[ctrl_rabbitmq]
ctrl01 ansible_host=192.168.0.1
ctrl02 ansible_host=192.168.0.2
ctrl03 ansible_host=192.168.0.3

We copy overcloud-components.j2 from the Kayobe source tree to etc/kayobe/kolla/inventory/overcloud-components.j2 in our kayobe-config repository and customise it:

[rabbitmq:children]
ctrl_rabbitmq

[outward-rabbitmq:children]
ctrl_rabbitmq

While better integrated with Kolla Ansible, this approach should be used with care so that the original control plane is not reconfigured in the process. Operators can use the --limit and --kolla-limit options of Kayobe to restrict Ansible playbooks to specific groups or hosts.

Customising Kolla images

Even though Kolla Ansible can be configured extensively, it is sometimes required to customise Kolla images. For example, we had to rebuild the heat-api container image so it would use a different Keystone domain name: Kolla uses heat_user_domain while the existing deployment used heat.

Once a modification has been pushed to the Kolla repository configured to be pulled by Kayobe, one can simply rebuild images with the kayobe overcloud container image build command.

Deploying services on the new control plane

Before deploying services on the new control plane, it can be useful to double-check that our configuration is correct. Kayobe can generate the configuration used by Kolla Ansible with the following command:

$ kayobe overcloud service configuration generate --node-config-dir /tmp/kolla

To deploy only specific services, the operator can restrict Kolla Ansible to specific roles using tags:

$ kayobe overcloud service deploy --kolla-tags glance

Migrating resources to new services

Most OpenStack services will start managing existing resources immediately after deployment. However, a few require manual intervention from the operator to perform the transition, particularly when services are not configured for high availability.

Cinder

Even when volume data is kept on a distributed backend like a Ceph cluster, each volume can be associated with a specific cinder-volume service. The service can be identified from the os-vol-host-attr:host field in the output of openstack volume show.

$ openstack volume show <volume_uuid> -c os-vol-host-attr:host -f value
ctrl01@rbd

There is a cinder-manage command that can be used to migrate volumes from one cinder-volume service to another:

$ cinder-manage volume update_host --currenthost ctrl01@rbd --newhost newctrl01@rbd

However there is no command to migrate specific volumes only, so if you are migrating to a bigger number of cinder-volume services, some will have have no volume to manage until the Cinder scheduler allocate new volumes on them.

Do not confuse this command with cinder migrate which is designed to transfer volume data between different backends. Be advised that when the destination is a cinder-volume service using the same Ceph backend, it will happily delete your volume data!

Neutron

Unless Layer 3 High Availability is configured in Neutron, routers will be assigned to a specific neutron-l3-agent service. The existing service can be replaced with the commands:

$ openstack network agent remove router --l3 <old-agent-uuid> <router-uuid>
$ openstack network agent add router --l3 <new-agent-uuid> <router-uuid>

Similarly, you can use the openstack network agent remove network --dhcp and openstack network agent add network --dhcp commands for DHCP agents.

Live migrating instances

In addition to the new control plane, several additional compute hosts were added to the system, in order to provide free resources that could host the first batch of live migrated instances. Once configured as Nova hypervisors, we discovered that we could not migrate instances to them because CPU flags didn't match, even though source hypervisors were using the same hardware.

This was caused by a mismatch in BIOS versions: the existing hypervisors in production had been updated to the latest BIOS to protect against the Spectre and Meltdown vulnerabilities, but these new hypervisors had not, resulting in different CPU flags.

This is a good reminder that in a heterogeneous infrastructure, operators should check the cpu_mode used by Nova. Kashyap Chamarthy's talk on effective virtual CPU configuration in Nova gives a good overview of available options.

What about downtime?

While we wanted to minimize the impact on end users and their workflow, there were no critical services running on this cloud that would have needed a zero downtime approach. If it had been a requirement, we would have explored dynamically added new control plane nodes to the existing clusters before removing the old ones. Instead, it was a welcome opportunity to reinitialize the configuration of several critical components to a clean slate.

The road ahead

This OpenStack deployment is now ready to benefit from all the improvements developed by the Kolla community, which released Kolla 8.0.0 and Kolla Ansible 8.0.0 for the Stein cycle earlier this summer and Kayobe 6.0.0 at the end of August. The community is now actively working on releases for OpenStack Train.

If you would like to get in touch we would love to hear from you. Reach out to us via Twitter or directly via our contact page.

by Pierre Riteau at September 17, 2019 01:37 PM

September 16, 2019

OpenStack Superuser

Must-see Containers Sessions at the Open Infrastructure Summit Shanghai

Join the open source community at the Open Infrastructure Summit Shanghai. The Summit schedule features over 100 sessions organized by use cases including:  container infrastructure, artificial intelligence and machine learning, high performance computing, 5G, edge computing, network functions virtualization, and public, private and multi-cloud strategies. 

Here we’re highlighting some of the sessions you don’t want to miss about container infrastructure. Check out the full list of sessions from this track here

Kata Containers: a Cornerstone for Financial Grade Cloud Native Infrastructure

In 2017, the Kata Containers project was formed out of the code bases contributed by Intel Clear Containers and Hyper.SH runV. A year and a half later, the OpenStack Foundation confirmed Kata Containers as a top level OpenInfra project and became a de facto standard of open source virtualized container technology. Meanwhile, Hyper.sh joined forces with Ant Financial to build the CloudNative infrastructure for financial services based on secure containers. 

During this session, Ant Financial’s Xu Wang will focus on an introduction to Kata Containers and AntFin’s secure containers practice. 

Keystone as The Authentication Center for OpenStack and Kubernetes

With the increase of container services, cloud platforms cannot meet the needs of customers due to only providing virtual machine and bare metal services. Customers need to be able to consume all three services so a unified user management and authentication system is a necessity. H3C Technologies decided to use Keystone as their user management and authentication service. Jun Gu and James Xu from H3C will cover the following topics during this session: 

  1. Introduction to Keystone
  2. User management and authentication for K8s & OpenStack
  3. Keystone enhancement
  4. Integrated with third parties

Run Kubernetes on OpenStack and Bare Metal fast

Running Kubernetes on top of OpenStack provides high levels of automation and scalability. Kuryr is an OpenStack project that is a CNI plugin using Neutron and Octavia to provide networking for pods and services being primarily designed for Kubernetes clusters running on OpenStack machines. 

Tests were performed to check how Kuryr increases the networking performance when running Kubernetes on OpenStack when compared to using OpenShift/OVS SDN. In this session, Ramon Acedo Rodriquez from Red Hat will update the latest integrations and architecture to run Kubernetes clusters on OpenStack and bare metal. In addition, Rodriquez will discuss aspects of performance improvements using Kuryr as the SDN by showing his test results. 

 

Join the global community, November 4-6 in Shanghai for these sessions and more that can help you create a strategy to solve your organization’s container infrastructure needs.

The post Must-see Containers Sessions at the Open Infrastructure Summit Shanghai appeared first on Superuser.

by Kendall Waters at September 16, 2019 02:43 PM

Nate Johnston

Calendar Merge

I work in the OpenStack community, which is a broad confederation of many teams working on projects that together compose an open source IaaS cloud. With a project of such magnitude, there are a lot of meetings, which in the OpenStack world take place on Freenode IRC. The OpenStack community has set up an automated system to schedule, manage, and log these meetings. You can see the web front end at Eavesdrop.

September 16, 2019 01:34 AM

September 13, 2019

StackHPC Team Blog

Fabric control in Intel MPI

High Performance Computing usually involves some sort of parallel computing and process-level parallelisation using the MPI (Message Passing Interface) protocol has been a common approach on "traditional" HPC clusters. Although alternative approaches are gaining some ground, getting good MPI performance will continue to be crucially important for many big scientific workloads even in a cloudy new world of software-defined infrastructure.

There are several high-quality MPI implementations available and deciding which one to use is important as applications must be compiled against specific MPI libraries - the different MPI libraries are (broadly) source-compatible but not binary-compatible. Unfortunately selecting the "right" one to use is not straightforward as a search for benchmarks will quickly show, with different implementations coming out on top in different situations. Intel's MPI has historically been a strong contender, with easy "yum install" deployment, good performance (especially on Intel processors), and being - unlike Intel's compilers - free to use. Intel MPI 2018 still remains relevant even for new installs as the 2019 versions have had various issues, including the fairly-essential hydra manager appearing not to work with at least some AMD processors. A fix for this is apparently planned for 2019 update 5 but there is no release date for this yet.

MPI can run over many different types of interconnect or "fabrics" that are actually carrying the inter-process communications, such as Ethernet, InfiniBand etc. and the Intel MPI runtime will, by default, automatically try to select a fabric which works. Knowing how to control fabric choices is however still important as there is no guarantee it will select the optimal fabric, and fall-back through non-working options can lead to slow startup or lots of worrying error messages for the user.

Intel significantly changed the fabric control between 2018 and 2019 MPI versions but this isn't immediately obvious from the changelog and you have to jump about between the developer references and developer guides to get the full picture. In both MPI versions the I_MPI_FABRICS environment variable specifies the fabric, but the values it takes are quite different:

  • For 2018 options are shm, dapl, tcp, tmi, ofa or ofi, or you can use x:y to control intra- and inter-node communications separately (see the docs for which combinations are valid).
  • For 2019 options are only ofi, shm:ofi or shm, with the 2nd option setting intra- and inter-node communications separately as before.

The most generally-useful options are probably:

  • shm (2018 & 2019): The shared memory transport; only applicable to intra-node communication so generally used with another transport as suggested above - see the docs for details.
  • tcp (2018 only): A TCP/IP capable fabric e.g. Ethernet or IB via IPoIB.
  • ofi (2018 & 2019): An "OpenFabrics Interfaces-capable fabric". These use a library called libfabric (either an Intel-supplied or "external" version) which provides a fixed application-facing API while talking to one of several "OFI providers" which communicate with the interconnect hardware. Really your choice of provider here depends on the hardware, with possibilities being:
    • psm2: Intel OmniPath
    • verbs: InfiniBand or iWARP
    • RxM: A utility provider supporting verbs
    • sockets: Again an TCP/IB capable fabric but this time through libfabric. It's not intended to be faster than the 2018 tcp option, but allows developing/debugging libfabric codes without actually having a faster interconnnect available.

With both 2018 and 2019 you can use I_MPI_OFI_PROVIDER_DUMP=enable to see which providers MPI thinks are available.

2018 also supported some additional options which have gone away in 2019:

  • ofa (2018): "OpenFabrics Alliance" e.g. InfiniBand (through OFED Verbs) & possibly also iWARP and RoCE?
  • dapl (2018): "Direct Access Programming Library" e.g. InfiniBand and iWARP.
  • tmi (2018): "Tag Matching Interface" e.g. Intel True Scale Fabric, Intel Omni-Path Architecture, Myrinet

With any of these fabrics there are additional variables to tweak things. 2018 has I_MPI_FABRICS_LIST which allows specification of a list of available fabrics to try, plus variables to control fallback through this list. These variables are all gone in 2019 now there are fewer fabric options. Clearly Intel have clearly decided to concentrate on OFA/libfabric which unifies (or restricts, depending on your view!) the application-facing interface.

If you're using the 2018 MPI over InfiniBand you might be wondering which option to use; at least back in 2012 performance between DAPL and OFA/OFED Verbs was apparently generally similar although the transport options available varied, so which is usable/best if both are available will depend on your application and hardware.

HPC Fabrics in the Public Cloud

Hybrid and public cloud HPC solutions have been gaining increasing attention, with scientific users looking to burst peak usage out to the cloud, or investigating the impact of wholesale migration.

Azure have been pushing their capabilities for HPC hard recently, showcasing ongoing work to get closer to bare-metal performance and launching a 2nd generation of "HB-series" VMs which provide 120 cores of AMD Epyc 7002 processors. With InfiniBand interconnects and as many as 80,000 cores of HBv2 available for jobs for (some) customers, Azure looks to be providing pay-as-you-go access to some very serious (virtual) hardware. And in addition to providing a platform for new HPC workloads in the cloud, for organisations which are already embedded in the Microsoft ecosystem Azure may seem an obvious route to acquiring a burst capacity for on-premises HPC workloads.

If you're running in a virtualised environment such as Azure, MPI configuration is likely to have additional complexities and a careful read of any and all documentation you can get your hands on is likely to be needed.

For example for Azure, the recommended Intel MPI settings described here, here and in the suite of pages here vary depending on which type of VM you are using:

  • Standard and most compute-optimised nodes only have Ethernet (needing tcp or sockets) which is likely to make them uninteresting for multi-node MPI jobs.
  • Hr-series VMs and some others have FDR InfiniBand but need specific drivers (provided in an Azure image), Intel MPI 2016 and the DAPL provider set to ofa-v2-ib0.
  • HC44 and HB60 VMs have EDR InfiniBand and can theoretically use any MPI (although for HB60 VMs note the issues with Intel 2019 MPI on AMD processors mentioned above) but need the appropriate fabric to be manually set.

InfiniBand on Azure still seems to be undergoing considerable development with for example new drivers for MVAPICH2 coming out around now so treat any guidance with a pinch of salt until you know it's not stale, to mix metaphors!

---

If you would like to get in touch we would love to hear from you. Reach out to us on Twitter or directly via our contact page.

by Steve Brasier at September 13, 2019 03:30 PM

Chris Dent

Placement Update 19-36

Here's placement update 19-36. There won't be one next week, I will be away. Because of my forthcoming "less time available for OpenStack" I will also be stopping these updates at some point in the next month or so so I can focus the limited time I will have on reviewing and coding. There will be at least one more.

Most Important

The big news this week is that after returning from a trip (that meant he was away during the nomination period) Tetsuro has stepped up to be the PTL for placement in Ussuri. Thanks very much to him for taking this up, I'm sure he will be excellent.

We need to work on useful documentation for the features developed this cycle.

I've also made a now worklist in StoryBoard to draw attention to placement project stories that are relevant to the next few weeks, making it easier to ignore those that are not relevant now, but may be later.

Stories/Bugs

(Numbers in () are the change since the last pupdate.)

There are 23 (-1) stories in the placement group. 0 (0) are untagged. 5 (0) are bugs. 4 (0) are cleanups. 10 (-1) are rfes. 5 (1) are docs.

If you're interested in helping out with placement, those stories are good places to look.

osc-placement

  • https://review.opendev.org/666542 Add support for multiple member_of. There's been some useful discussion about how to achieve this, and a consensus has emerged on how to get the best results.

Main Themes

Consumer Types

Adding a type to consumers will allow them to be grouped for various purposes, including quota accounting.

  • https://review.opendev.org/#/q/topic:bp/support-consumer-types This has some good comments on it from melwitt. I'm going to be away next week, so if someone else would like to address them that would be great. If it is deemed fit to merge, we should, despite feature freeze passing, since we haven't had much churn lately. If it doesn't make it in Train, that's fine too. The goal is to have it ready for Nova in Ussuri as early as possible.

Cleanup

Cleanup is an overarching theme related to improving documentation, performance and the maintainability of the code. The changes we are making this cycle are fairly complex to use and are fairly complex to write, so it is good that we're going to have plenty of time to clean and clarify all these things.

Performance related explorations continue:

One outcome of the performance work needs to be something like a Deployment Considerations document to help people choose how to tweak their placement deployment to match their needs. The simple answer is use more web servers and more database servers, but that's often very wasteful.

Other Placement

Miscellaneous changes can be found in the usual place.

There are three os-traits changes being discussed. And two os-resource-classes changes. The latter are docs-related.

Other Service Users

New reviews are added to the end of the list. Reviews that haven't had attention in a long time (boo!) or have merged or approved (yay!) are removed.

End

🐈

by Chris Dent at September 13, 2019 11:18 AM

September 10, 2019

Aptira

10th Birthday + 10% off!

Aptira 10 year birthday 10% off sale

It’s our 10th birthday – and you get the presents!

Did you know that Aptira was founded at 9 minutes past 9, on the 9th day of the 9th month, in 2009? 2009 was also the year that NASA launched the final space shuttle mission to the Hubble Telescope. Great things happened on this day, with the founding of Aptira being no exception.

Yesterday we turned 10! We wouldn’t be here if it wasn’t for our amazing customers. So to celebrate, we are offering 10% off all our services from the 10th of September until the 10th of October. That’s 10% off managed services, 10% off training, 10% off everything except hardware. This 10% discount also applies to pre-paid services, so you can pre-pay for the next 12 months to really maximise your savings!

And for the extra icing on the cake (even though it doesn’t have a 10 in it), we’ll give you a free 2 hour consulting session to help get you started with transforming your Cloud solution. Chat with a Solutionaut today to take advantage of this once in a decade discount.

The post 10th Birthday + 10% off! appeared first on Aptira.

by Aptira at September 10, 2019 01:00 PM

September 09, 2019

OpenStack Superuser

Must-see 5G and edge computing sessions at the Open Infrastructure Summit Shanghai

Creating an edge computing strategy? Looking for reference architectures or a vendor to support your strategy?

Join the people building and operating open infrastructure at the Open Infrastructure Summit Shanghai, November 4-6 where you will come with questions, and leave with an edge computing strategy. The Summit schedule features over 100 sessions covering over 30 open source projects organized by use cases including: artificial intelligence and machine learning, high performance computing, 5G, edge computing, network functions virtualization (NFV), container infrastructure and public, private and multi-cloud strategies.

Here we’re highlighting some of the sessions you’ll want to add to your schedule about edge computing. Check out the entire track here.

Towards Guaranteed Low Latency And High Security Service Based On Mobile Edge Computing (MEC) Technology

As a 5G pioneer, SK Telecom (SKT) developed their own MEC platform from last year. It was designed and developed to respond to a variety of business requirements, as well as complies with 3GPP/ETSI standards. To interwork with current 4G/5G technologies, SKT implemented unique edge routing technology and commercialized this MEC platform last year. The platform is linked to 5their G network and is currently providing smart factory pilot service which requires extremely low latency. This talk will provide an overview of SKT’s MEC architecture, lessons learned from commercialization and their future plans.

Secured Edge Infrastructure For Contactless Payment System

China UnionPay will discuss their StarlingX architecture and how to apply security hardening features on their underlying OpenStack and Kubernetes platform. They will describe the architecture with support of both virtual machine and container resources for the edge payment service application including face recognition, car license plate detection, payment, and more. Learn more about smart payment requirements and reference implementations that use case, including capabilities like resource management, security isolation, and more.

Network Function Virtualization Orchestration By Airship

This session will cover how to enable OVS-DPDK in Airship and demonstrate the end-to-end deployment flow of OVS-DPDK in Airship. Moreover, the speakers will present the implementation details like creating DPDK-enabled docker images for OVS, handling hugepage allocation for DPDK in OpenStack-Helm and Kubernetes, CPU pinning, and more.

Join the global community, November 4-6 in Shanghai for these sessions and more that can help you create a strategy to solve your organization’s edge computing needs.

The post Must-see 5G and edge computing sessions at the Open Infrastructure Summit Shanghai appeared first on Superuser.

by Allison Price at September 09, 2019 02:45 PM

CERN Tech Blog

Software RAID support in OpenStack Ironic

The vast majority of the ~15’000 physical servers in the CERN IT data centers rely on Software RAID to protect services from disk failures. With the advent of OpenStack Ironic to manage this bare metal fleet, the CERN cloud team started to work with the upstream community on adding Software RAID support to Ironic’s feature list. Software RAID support in Ironic is now ready to be released with OpenStack’s Train release, but (backported to Stein) the code is already in production on more than 1’000 nodes at CERN.

by CERN (techblog-contact@cern.ch) at September 09, 2019 10:15 AM

September 08, 2019

Christopher Smart

Setting up a monitoring host with Prometheus, InfluxDB and Grafana

Prometheus and InfluxDB are powerful time series database monitoring solutions, both of which are natively supported with graphing tool, Grafana.

Setting up these simple but powerful open source tools gives you a great base for monitoring and visualising your systems. We can use agents like node-exporter to publish metrics on remote hosts which Prometheus will scrape, and other tools like collectd which can send metrics to InfluxDB’s collectd listener (more on that later!).

Prometheus’ node exporter metrics in Grafana

I’m using CentOS 7 on a virtual machine, but this should be similar to other systems.

Install Prometheus

Prometheus is the trickiest to install, as there is no Yum repo available. You can either download the pre-compiled binary or run it in a container, I’ll do the latter.

Install Docker and pull the image (I’ll use Quay instead of Dockerhub).

sudo yum install docker
sudo systemctl start docker
sudo systemctl enable docker
sudo docker pull quay.io/prometheus/prometheus

Create a basic configuration file for Prometheus which we will pass into the container. This is also where we configure clients for Prometheus to pull data from, so let’s add a localhost target for the monitor node itself.

cat << EOF | sudo tee /etc/prometheus.yml
global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node'
    static_configs:
    - targets:
      - localhost:9100
EOF

Now we can start a persistent container. We’ll pass in the config file we created earlier but also a dedicated volume so that the database is persistent across updates. We use host networking so that Prometheus can talk to localhost to monitor itself (not required if you want to configure Prometheus to talk to the host’s external IP instead of localhost).

sudo docker run -dit \
--network host \
--name prometheus \
--restart always \
-p 9090:9090 \
--volume prometheus:/prometheus \
-v /etc/prometheus.yml:/etc/prometheus/prometheus.yml:Z \
quay.io/prometheus/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--web.enable-lifecycle \
--web.enable-admin-api

Check that the container is running properly, it should say that it is ready to receive web requests in the log. You should also be able to browse to the endpoint on port 9090 (you can run queries here, but we’ll use Grafana).

sudo docker ps
sudo docker logs prometheus

Updating Prometheus config

Updating and reloading the config is easy, just edit /etc/prometheus.yml and send a message to Prometheus to reload (this was enabled by web.enable-lifecycle option). This is useful when adding new nodes to scrape metrics from.

curl -s -XPOST localhost:9090/-/reload

In the container log (as above) you should see that it has reloaded the config.

Installing Prometheus node exporter

You’ll notice in the Prometheus configuration above we have a job called node and a target for localhost:9100. This is a simple way to start monitoring the monitor node itself! Installing the node exporter in a container is not recommended, so we’ll use the Copr repo and install with Yum.

sudo curl -Lo /etc/yum.repos.d/_copr_ibotty-prometheus-exporters.repo \
https://copr.fedorainfracloud.org/coprs/ibotty/prometheus-exporters/repo/epel-7/ibotty-prometheus-exporters-epel-7.repo

sudo yum install node_exporter
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

It should be listening on port 9100 and Prometheus should start getting metrics from http://localhost:9100/metrics automatically (we’ll see them later with Grafana).

Install InfluxDB

Influxdata provides a yum repository so installation is easy!

cat << \EOF | sudo tee /etc/yum.repos.d/influxdb.repo
[influxdb]
name=InfluxDB
baseurl=https://repos.influxdata.com/centos/$releasever/$basearch/stable
enabled=1
gpgcheck=1
gpgkey=https://repos.influxdata.com/influxdb.key
EOF
sudo yum install influxdb

The defaults are fine, other than enabling collectd support so that other clients can send metrics to InfluxDB. I’ll show you how to use this in another blog post soon.

sudo sed-i 's/^\[\[collectd\]\]/#\[\[collectd\]\]/' /etc/influxdb/influxdb.conf
cat << EOF | sudo tee -a /etc/influxdb/influxdb.conf
[[collectd]]
  enabled = true
  bind-address = ":25826"
  database = "collectd"
  retention-policy = ""
   typesdb = "/usr/local/share/collectd"
   security-level = "none"
EOF

This should open a number of ports, including InfluxDB itself on TCP port 8086 and collectd receiver on UDP port 25826.

sudo ss -ltunp |egrep "8086|25826"

Create InfluxDB collectd database

Finally, we need to connect to InfluxDB and create the collectd database. Just run the influx command.

influx

And at the prompt, create the database and exit.

CREATE DATABASE collectd
exit

Install Grafana

Grafana has a Yum repository so it’s also pretty trivial to install.

cat << EOF | sudo tee /etc/yum.repos.d/grafana.repo
[grafana]
name=Grafana
baseurl=https://packages.grafana.com/oss/rpm
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
EOF
sudo yum install grafana

Grafana pretty much works out of the box and can be configured via the web interface, so simply start and enable it. The server listens on port 3000 and the default username is admin with password admin.

sudo systemctl start grafana
sudo systemctl enable grafana
sudo ss -ltnp |grep 3000

Now you’re ready to log into Grafana!

Configuring Grafana

Browse to the IP of your monitoring host on port 3000 and log into Grafana.

Now we can add our two data sources. First, Prometheus, poing to localhost on port 9090

..and then InfluxDB, pointing to localhost on port 8086 and to the collectd database.

Adding a Grafana dashboard

Make sure they tested OK and we’re well on our way. Next we just need to create some dashboards, so let’s get a dashboard to show node exporter and we’ll hopefully at least see the monitoring host itself.

Go to Dashboards and hit import.

Type the number 1860 in the dashboard field and hit load.

This should automatically download and load the dash, all you need to do is select your Prometheus data source from the Prometheus drop down and hit Import!

Next you should see the dashboard with metrics from your monitor node.

So there you go, you’re on your way to monitoring all the things! For anything that supports collectd, you can forward metrics to UDP port 25826 on your monitor node. More on that later…

by Chris at September 08, 2019 12:18 PM

September 06, 2019

Nate Johnston

Joining the TC

While my candidate statement goes in to some detail about why I wanted to run for the OpenStack Technical Committee (“TC”), I wanted to write a bit more about it to explain where I am coming from and what I feel I can offer. As far as the TC is concerned, I am “new blood”. I have worked in some positions that are a part of OpenStack community stewardship before - in 2016 I spent a season as one of the election officials, and I have worked previously as an infrastructure liaison from the Neutron community.

September 06, 2019 03:26 PM

Chris Dent

Placement Update 19-35

Let's have a placement update 19-35. Feature freeze is this week. We have a feature in progress (consumer types, see below) but it is not critical.

Most Important

Three main things we should probably concern ourselves with in the immediate future:

  • We are currently without a PTL for Ussuri. There's some discussion about the options for dealing with this in an email thread. If you have ideas (or want to put yourself forward), please share.

  • We need to work on useful documentation for the features developed this cycle.

  • We need to create some cycle highlights. To help with that I've started an etherpad. If I've forgotten anything, please make additions.

What's Changed

  • osc-placement 1.7.0 has been released. This adds support for managing allocation ratios via aggregates, but adding a few different commands and args for inventory manipulation.

  • Work on consumer types exposed that placement needed to be first class in grenade to make sure database migrations are run. That change has merged. Until then placement was upgraded as part of nova.

Stories/Bugs

(Numbers in () are the change since the last pupdate.)

There are 24 (-1) stories in the placement group. 0 (0) are untagged. 5 (0) are bugs. 4 (0) are cleanups. 11 (-1) are rfes. 4 (0) are docs.

If you're interested in helping out with placement, those stories are good places to look.

osc-placement

  • https://review.opendev.org/666542 Add support for multiple member_of. There's been some useful discussion about how to achieve this, and a consensus has emerged on how to get the best results.

  • --amend and --aggregate on resource provider inventory has merged and been release 1.7.0 (see above).

Main Themes

Consumer Types

Adding a type to consumers will allow them to be grouped for various purposes, including quota accounting.

Cleanup

Cleanup is an overarching theme related to improving documentation, performance and the maintainability of the code. The changes we are making this cycle are fairly complex to use and are fairly complex to write, so it is good that we're going to have plenty of time to clean and clarify all these things.

Performance related explorations continue:

One outcome of the performance work needs to be something like a Deployment Considerations document to help people choose how to tweak their placement deployment to match their needs. The simple answer is use more web servers and more database servers, but that's often very wasteful.

Other Placement

Miscellaneous changes can be found in the usual place.

  • https://review.opendev.org/676982 Merge request log and request id middlewares is worth attention. It makes sure that all log message from a single request use a global and local request id.

There are three os-traits changes being discussed. And zero os-resource-classes changes.

Other Service Users

This week (because of feature freeze) I will not be adding new finds to the list, just updating what was already on the list.

End

🐎

by Chris Dent at September 06, 2019 10:53 AM

September 05, 2019

Mirantis

OpenStack vs AWS Total Cost of Ownership: Assumptions behind the TCO Calculator

You may think you know whether OpenStack or AWS is more expensive, but it's a complicated process to decide. Here are some things you need to consider.

by Nick Chase at September 05, 2019 03:22 PM

Galera Cluster by Codership

Galera Cluster with new Galera Replication Library 3.28 and MySQL 5.6.45, MySQL 5.7.27 is GA

Codership is pleased to announce a new Generally Available (GA) release of Galera Cluster for MySQL 5.6 and 5.7, consisting of MySQL-wsrep 5.6.45-25.27 and 5.7.27-25.19 with a new Galera Replication library 3.28 (release notes, download) implementing wsrep API version 25. This release incorporates all changes into MySQL 5.6.45 (release notes, download) and MySQL 5.7.27 (release notes, download) respectively.

Compared to the previous release, the Galera Replication library has a few notable fixes: it has enhanced UUID detection, and builds on esoteric platforms to benefit distributions shipping Galera Cluster like Debian on platforms like hppa/hurd-i386/kfreebsd. The 5.7.27-25.19 release also fixes a segmentation fault (segfault) when the wsrep_provider is set to none.

This release marks the last release for OpenSUSE 13.1 as the release itself has reached End-of-Life (EOL) status. It should also be noted that the next release will mark the EOL for OpenSUSE 13.2. If you are still using this base operating system and are unable to upgrade, please contact info@codership.com for more information.

You can get the latest release of Galera Cluster from http://www.galeracluster.com. There are package repositories for Debian, Ubuntu, CentOS, RHEL, OpenSUSE and SLES. The latest versions are also available via the FreeBSD Ports Collection.

 

by Colin Charles at September 05, 2019 09:20 AM

September 04, 2019

OpenStack Superuser

Tapping into Roots to Accelerate Open Infrastructure Growth in Japan

It is always a delight to be in Tokyo.  The people, the food, the tradition… if times were different, I would seriously consider becoming an expat.  More than that, Tokyo has a vibrant and involved OpenStack community that never ceases to invigorate my feelings around open source.  You can see it in every aspect of Cloud Native Days Japan, where the concept of open infrastructure comes to life through demos and talks from companies like NEC, NTT, Red Hat, Cyber Agent, Yahoo! Japan, and Fujitsu.

This is the second event that merges OpenStack Day Japan with the CNCF’s Cloud Native Day, and the growth of the event speaks to its success.  Our host and friend Akihiro Hasegawa kicked things off, noting over 1,600 attendees made for a standing-room only keynote from:

  • Mark Collier / Chief Operating Officer, OSF
  • Noriaki Fukuyasu / VP of Japan Operations, The Linux Foundation
  • Melanie Cebula / Software Engineer, Airbnb
  • Stephan Fabel / Director of Product, Canonical Ltd.
  • Doug David / STSM, Offering Manager Knative, IBM

The key takeaway? Open infrastructure plays a critical role and will continue for companies working in the cloud.  OSF’s Mark Collier reinforced the message brought forth at the Open Infrastructure Summit Denver. “Collaboration without boundaries works, proving that it is one of the best ways to produce software,” he said.  OpenStack has had 19 on time releases and is one of the top three most active open source projects. Citing “no better example than CERN,” running Kubernetes on top of OpenStack to create one of the largest open source clouds in the world.

Stephen Fabel followed up with a timely and related talk, 10 New Rules of Open Infrastructure. Rule #1:  “Consume unmodified upstream”. “The whole point of open infrastructure is to be able to engage with the larger community for support and to create a common basis for hiring, training and innovating on your next-generation infrastructure platform.”  His message was a strong follow up to Collier’s, with a clear message around the power of open source and open infrastructure for Canonical’s customers.

Beyond the keynotes, there were packed rooms for talks on Kata Containers, Zuul CI, the future of the OpenStack community, and a deep dive into OpenStack in the financial sector from Y Jay FX.  OpenStack was well represented, along with some in depth content around Kubernetes from CNCF.

On Monday evening, participants were treated to an OpenStack Foundation birthday celebration in the marketplace (one of 25 in the world!), along with a raffle of prizes from sponsors and local businesses.  It was an amazing event in a style befitting Tokyo.  We’re extremely grateful to the OpenStack Japan User Group and Akihiro Hasegawa in particular for their continued support and efforts towards the open infrastructure community.

Want to collaborate with the global open infrastructure community? Check out the upcoming Open Infrastructure Summit Shanghai happening November 4-6 or check out the OSF events page for upcoming local meetups and Open Infrastructure Days near you!

The post Tapping into Roots to Accelerate Open Infrastructure Growth in Japan appeared first on Superuser.

by Jimmy McArthur at September 04, 2019 01:00 PM

September 03, 2019

OpenStack Superuser

Analysis of Kubernetes and OpenStack Combination for Modern Data Centers

For many telecom service providers and enterprises who are transforming their data center to modern infrastructure, moving to containerized workloads has become a priority. However, vendors often do not choose to shift completely to a containerized model. 

Data centers have to support virtual machines (VMs) as well to keep up with legacy VMs. Therefore, a model of managing virtual machines with OpenStack and containers using Kubernetes has become popular. In an OpenStack survey conducted in 2018, it was seen that 61% OpenStack deployments are also working with Kubernetes.

Apart from this, some of the recent tie-ups and releases of platforms clearly show this trend. For example:

  • AT&T’s 3 years deal with Mirantis to develop 5G core backed by Kubernetes and OpenStack,
  • Platform9’s Managed OpenStack and Kubernetes – providing required featured sets bundled in solution stack for the service provider as well as developers. They support Kubernetes on VMware platform as well.
  • Nokia’s CloudBand release – containing Kubernetes and OpenStack for workload orchestrations
  • OpenStack Foundation’s recently announced Airship project brings the power of OpenStack and Kubernetes in one framework.

The core part of a telecom network or any virtualized core of a data center has undergone a revolution, shifting from Physical Network Functions (PNFs) to Virtual Network Functions (VNFs). Organizations are now adopting Cloud-Native Network Functions (CNFs) to help bring CI/CD-driven agility into the picture. 

The journey is shown in one of the slides from the Telecom User Group session at KubeCon Barcelona in May 2019, which was delivered by Dan Kohn, the executive director of CNCF and Cheryl Hund, the director of ecosystem of CNCF.

Figure – PNFs to VNFs

Image source: https://kccnceu19.sched.com/event/MSzj/intro-deep-dive-bof-telecom-user-group-and-cloud-native-network-functions-cnf-testbed-taylor-carpenter-vulk-coop-cheryl-hung-dan-kohn-cncf

According to the slide, presently, application workloads deployed in virtual machines (VNFs) and containers (CNFs) can be managed with OpenStack and Kubernetes, respectively, on top of bare metal or any cloud. The optional part that is ONAP is a containerized MANO framework, which is managed with Kubernetes.

As discussed in birds-of-a-feather (BoF) – Telecom User Group session delivered by Kohn that –  with the progress of Kubernetes for cloud-native movement, it is expected that CNFs will be a key workload type. Kubernetes will be used to orchestrate CNFs as well as VNFs. VNFs will be segregated with KubeVirt or Virtlet or OpenStack on top of Kubernetes.

Approaches for managing workloads using Kubernetes and OpenStack

Let’s understand the approaches of integrating Kubernetes with OpenStack for managing containers and VMs.

The first approach can be a basic approach wherein Kubernetes co-exists with OpenStack to manage containers. It gives a good performance but you cannot manage unified infrastructure resources through a single pane. This causes problems associated with planning and devising policies across workloads. Also, it can be difficult to diagnose any problems affecting the performance of resources in operations.

The second approach can be running a Kubernetes cluster in a VM managed by OpenStack. This enables OpenStack-based infrastructure to leverage the benefits of Kubernetes within a centrally managed OpenStack control system. Also, it allows full feature multi-tenancy and security benefits for containers in an OpenStack environment. However, this contributes to performance lags and necessitates additional workflows to manage VMs that are hosting Kubernetes.

The third approach is an innovative one, leaning towards a completely cloud-native environment. In this approach, Kubernetes can be replaced with OpenStack to manage containers along with VMs as well. Workloads take complete advantage of hardware accelerators, Smart NICs etc. With this, it is possible to offer integrated VNS solutions with container workloads for any data center, but this demands improved networking capabilities like in OpenStack (SFC, Provider Networks, Segmentation).

Kubernetes Vs OpenStack. Is it true?  

If you look at schedule upcoming VMworld US 2019, it can be clearly seen that Kubernetes will be everywhere. There will be 66 sessions and some hands-on training that will focus only on Kubernetes integration in every aspect of IT infrastructure.

But is that end of OpenStack? No. As we have already seen, the combination of both systems will be a better bet for any organization that wants to stick with traditional workloads while gradually moving to a new container-based environment.

How Kubernetes and OpenStack are going to combine?

I came across a very decent LinkedIn post by Michiel Manten. He stated that there are downfalls for both containers and VMs. Both have their own use cases and orchestration tools. OpenStack and Kubernetes will complement each other if properly combined to run some of the workloads in VMs to get isolation benefits within a server and some are in containers. One way to achieve this combination is to run Kubernetes clusters within VMs in OpenStack, which eliminates the security pitfalls of containers while leveraging the reliability and resiliency of VMs.

What are the benefits?

  • Combining systems will immediately benefit all current workloads so that enterprises can start their modernization progress, maintaining high speed much lower cost than commercial solutions.
  • Kubernetes and OpenStack can be an ideal and flexible solution for any form of a cloud or new far-edge cloud where automated deployment, orchestration, and latency will be the concern.
  • All workloads will be in a single network in a single IT ecosystem. This makes it easier to apply high-level network and security policies.
  •  OpenStack supports most enterprise storage and networking systems in use today. Running Kubernetes with and on top of OpenStack enables a seamless integration of containers into your IT infrastructure. Whether you want to run containerized applications bare metal or VMs, OpenStack allows you to run containers the best way for your business.
  •  Kubernetes has self-healing capabilities for infrastructure. As it is integrated into an OpenStack, it can enable easy management and resiliency to failure of core services and compute nodes.
  • A recent 19th release of OpenStack software (OpenStack Stein) has several enhancements to support Kubernetes in the stack. A team behind OpenStack Certified Kubernetes installer made it possible to deploy all containers in a cluster within 5 minutes regardless of the number of nodes. It was previously 10-12 minutes. With this, we can launch a very large-scale Kubernetes environment in 5 minutes.

Telecom service providers who have taken steps towards 5G agreed upon the fact that a cloud-native core is imperative for a 5G network. OpenStack and Kubernetes are mature, open-source operating and orchestration frameworks today. Providing agility is the key capability of Kubernetes for data centers and OpenStack has several successful projects for focusing on storage and networking of workloads, and support for myriad applications.

About the author

Sagar Nangare is a technology blogger, focusing on data center technologies (networking, telecom, cloud, storage) and emerging domains like edge computing, IoT, machine learning, AI). He works at Calsoft Inc. as a digital strategist.

Photo // CC BY NC

The post Analysis of Kubernetes and OpenStack Combination for Modern Data Centers appeared first on Superuser.

by Sagar Nangare at September 03, 2019 01:00 PM

August 31, 2019

Ghanshyam Mann

OpenStack CI/CD migration from Ubuntu Xenial -> Bionic (Ubuntu LTS 18.04)

Ubuntu Bionic (Ubuntu LTS 18.04) was released on April 26, 2018 but OpenStack CI/CD and its all gate jobs were running on Ubuntu Xenial version. We have to migrate the OpenStack to Ubuntu Bionic at this time and make sure we test all ur gate jobs with Ubuntu Bionic image.

Jens Harbott (frickler) and I started this task in December 2018 in two phases. Phase 1st to migrate the zuulv3 devstack based jobs and 2nd phase to migrate the legacy zuulv2 jobs.

    What is the meaning of migration:

OpenStack CI/CD is implemented with Zuul jobs prepare the node to deploy the OpenStack using Devstack and run tests (Tempest or its plugins, project in-tree tests, rally tests etc). Base OS installed on node is what where OpenStack will be deployed by DevStack.

Till OpenStack Rocky release, base OS on node was Ubuntu Xenial. So DevStack will deploy OpenStack on Ubuntu Xenial and then running tests to make sure every project governed by OpenStack work properly on Ubuntu Xenial.

With the new version of Ubuntu Bionic, the node base OS has been moved from Ubuntu Xenial -> Ubuntu Bionic. The same way it will make sure the OpenStack CI/CD with Ubuntu Bionic. On every code change, it will make sure OpenStack work properly on Ubuntu Bionic.

    Goal:

The end goal is to migrate all the OpenStack projects gate jobs testing from Ubuntu Xenial to Ubuntu Bionic and make sure OpenStack works fine on Bionic.

As OpenStack support stable branch also, all the jobs running till OpenStack Rocky gate will be on Ubuntu Xenial and all jobs running from OpenStack Stein onwards will be on Ubuntu Bionic.

    Phase1:

It started with devstack based zuul v3 jobs first

  • https://etherpad.openstack.org/p/devstack-bionic

devstack base job nodeset has been switched to Bionic and before we merge the devstack patch, we had DNM testing

patches on each project side to make sure we do not break anyone. Few projects faced the issue and fixed before devstack patch merge.

You can check more details on ML.

    Phase2:

After finishing the devstack zuul v3 native jobs, we have to move all legacy jobs also to Bionic. Most of the projects are still using legacy jobs and as per Stein PTI we need to move to next python version (py3.6 and py3.7 in Train).

Work is tracked in https://etherpad.openstack.org/p/legacy-job-bionic

legacy-base and legacy-dsvm-base job nodeset was moved to Bionic and before it merge I have pushed DNM testing patch for all the projects.

Step 1. Push the testing DNM patch on your project owned repos with:

– Depends-On: https://review.openstack.org/#/c/641886/

-Remove or change the nodeset to Bionic from your repo owned legacy jobs (if any of the legacy jobs has overridden the nodeset from parent job) Example: https://review.openstack.org/#/c/639017

Step 2. If you have any legacy job not derived from base jobs then, you need to migrate their nodeset to bionic which are defined in – https://review.openstack.org/#/c/639018/ Example – https://review.openstack.org/#/c/639361/

Step3. if any of the jobs start failing on bionic then, either fix it before the deadline of March 13th or make failing job as n-v and fix later.

You can check more details on ML.

Below Diagram gives a quick glance of base image nodeset using Bionic

If you want to verify the nodeset used in your zuul jobs, you can see the hostname and label in job-output.txt

 

    Migrate the third-party CI to Bionic:

The same way you can migrate your third-party CI also to Bionic. If third-party job is using the base job without overriding the ‘nodeset’ then job is automatically switched to Bionic. If job overrides the ‘nodeset’ then, you can use the Bionic node to test on Bionic. Third-party CI jobs are not migrated as part of upstream migration.

    Completion Summary Report:

– Started this migration very late in stein (in Dec 2018)

– Finished migration in 2 part. 1. zuulv3 native job migration, 2. legacy jobs migration.

– networking-midonet bionic jobs are n-v because it is not yet ready on bionic.

– ~50 patches merged to migrate the gate jobs

– ~ 60 DNM testing patches done before migration happens

– We managed almost zero downtime in the gate. Except for Cinder which was for 1 day.

I sent the summary of this work on OpenStack ML also- here with all references.

by Ghanshyam Mann at August 31, 2019 07:00 PM

August 30, 2019

Aptira

Apigee API Translation

One of the challenges we’ve faced recently involved an API translation mechanism required to perform API translations among different component’s native APIs, delivering a response as per a pre-determined set of requirements.


The Challenge

API translation between different solution components is always required for an application to run. Integrating many different software components or products can be relatively easy if they can communicate with each other through a single communication channel. Thus, having a single API gateway which can be used by all other components for seamless communication among them is always a win-win situation.

One of our customers wanted to expose services as set of HTTP endpoints. This was required so that client application developers can make HTTP requests to these endpoints. Depending on the endpoint, the service might then return data, formatted as XML or JSON, back to the client app. Also, content mapping was one of their major requirements i.e. Modifying the Input data to a Rest API on the fly and then extracting the desired values from the response as per the requirement.

Because the customer wanted to make their service available to the web, they wanted to make sure that all necessary steps have been taken in order to secure and protect their services from unauthorized access. Also, it can be easily consumed by other apps/components, enabling them to change the backend service implementation without affecting the public API.


The Aptira Solution

Google Cloud’s Apigee API Platform has been selected for this project due to its extensive set of features that satisfied customer requirements such as - Rate Limiting, Data translation and flexible deployment options and the API Portal.

We deployed the On-Premises version of Apigee in a Private Cloud environment and then created an Apigee proxy for API translation as per Telemanagement Forum (TMF). This Proxy used Apigee’s multiple inbuilt policies for translating the Cloudify API as per TMF.

  • The Proxy’s Preflow uses Extract message policies to in-turn extract the Cloudify blueprint ID from input JSON. This is further used by the Service callout policy to create the cloudify deployment
  • Once the deployment is complete, a deployment ID is extracted using Extract message policy
  • This is further followed by a service callout policy to start the deployment execution
  • Once the execution has been finalised, an Assign Message policy is used to create an NBI service order as per TMF standards using the deployment ID which was generated earlier

The Result

Apigee has enabled us to perform API translations among different component’s native APIs, delivering a response as per a pre-determined set of requirements.


OTHER APIGEE CASE STUDIES

Let us make your job easier.
Find out how Aptira's managed services can work for you.

Find Out Here

The post Apigee API Translation appeared first on Aptira.

by Aptira at August 30, 2019 01:28 PM

Chris Dent

Placement Update 19-34

Welcome to placement update 19-34. Feature Freeze is the week of September 9th. We have features in progress in placement itself (consumer types) and osc-placement that would be great to land.

Most Important

In addition to the features above, we really need to get started on tuning up the documentation so that same_subtree and friends can be used effectively.

It is also time to start thinking about what features, if any, need to be pursued in Ussuri. If there are few, that ought to leave time and energy for getting the osc-placement plugin more up to date.

And, there are plenty of stories (see below) that need attention. Ideally we'd end every cycle with zero stories, including removing ones that no longer make sense.

What's Changed

  • Tetsuro has picked up the baton for performance and refactoring work and found some improvements that have merged. There's additional work in progress (noted below).

Stories/Bugs

(Numbers in () are the change since the last pupdate.)

There are 25 (2) stories in the placement group. 0 (0) are untagged. 5 (1) are bugs. 4 (0) are cleanups. 12 (1) are rfes. 4 (0) are docs.

If you're interested in helping out with placement, those stories are good places to look.

osc-placement

osc-placement is currently behind by 12 microversions.

  • https://review.opendev.org/666542 Add support for multiple member_of. There's been some useful discussion about how to achieve this, and a consensus has emerged on how to get the best results.

  • https://review.opendev.org/640898 Adds a new --amend option which can update resource provider inventory without requiring the user to pass a full replacement for inventory and an --aggregate option to set inventory on all the providers in an aggregate. This has been broken up into three patches to help with review. This one is very close but needs review from more people than Matt.

Main Themes

Consumer Types

Adding a type to consumers will allow them to be grouped for various purposes, including quota accounting.

I picked this up yesterday and hope to have it finished next week, barring distractions. I figure having it in place for nova for Ussuri is a nice to have.

Cleanup

Cleanup is an overarching theme related to improving documentation, performance and the maintainability of the code. The changes we are making this cycle are fairly complex to use and are fairly complex to write, so it is good that we're going to have plenty of time to clean and clarify all these things.

Performance related explorations continue:

One outcome of the performance work needs to be something like a Deployment Considerations document to help people choose how to tweak their placement deployment to match their needs. The simple answer is use more web servers and more database servers, but that's often very wasteful.

Discussions about using a different JSON serializer terminated choosing not to use orjson because it presents some packaging and distribution issues that might be problematic. There's still an option to use one of the other alternatives, but that exploration has not started.

Other Placement

Miscellaneous changes can be found in the usual place.

  • https://review.opendev.org/676982 Merge request log and request id middlewares is worth attention. It makes sure that all log message from a single request use a global and local request id.

There are two os-traits changes being discussed. And zero os-resource-classes changes.

Other Service Users

New discoveries are added to the end. Merged stuff is removed. Anything that has had no activity in 4 weeks has been removed.

End

by Chris Dent at August 30, 2019 09:48 AM

August 29, 2019

Aptira

Implementing TMF APIs in Apigee

Aptira Apigee TMF APIs

A large Telco is building an Orchestration platform to orchestrate workloads that will be deployed in their Private Cloud infrastructure and spread across multiple data centers. Their internal systems must also integrate with the platform, requiring the implementation of TMF APIs.


The Challenge

The customer’s internal environment includes a large set of systems, including Product Ordering, operations support systems (OSS), business support systems (BSS) and catalog systems that are functionally North-bound systems to the Orchestration platform. To effectively integrate these systems, the platform requires implementation of North bound Interfaces (NBI) between the Orchestration platform and those systems.

The core challenge was to ensure that requests for Customer Facing Services (CFS) are translated into necessary actions on the Resource Facing Services (RFS) or Network service instances. These instances are managed by Cloudify in the NFVi Domain to implement the required service changes.


The Aptira Solution

In order to support and enable the customer’s roadmap of transforming these IT systems and to enable seamless integration of these systems with the orchestration platform, a common API framework using standard Telemanagement Forum (TMF) APIs was proposed.

TMF defines a set of guidelines, standardizing the way IT systems interact to manage and deliver complex services such as Customer Facing Service (CFS) and Resource Facing Service (RFS). It defines a suite of APIs (based on the OpenAPI standard) that defines the specifications and data model for each type of function that is invoked by IT systems. Examples include Service Order and Activation, Service Catalog, Service Qualification.

Aptira’s design for the North Bound Interface (NBI) function was to use Google Cloud’s Apigee API Platform as the implementation platform, and to make use of its various configuration and customization capabilities. Apigee provides extensive API mapping and logging functions that they call policies. These policies can then be applied to each API call that transits through the platform. In addition to this, there are multiple options for more extensive customisation through code development.

To handle the TMF API requests from the NBI systems in the Orchestration platform, APIs were implemented in Apigee in following stages, for each TMF-specified function call that need to be handled from north bound systems:

Stage 1

  • An API proxy endpoint is created in Apigee using the Apigee administration portal
  • Security policies to enable communication between Apigee and NBI systems are configured
  • A business workflow is designed by defining actions that need to be triggered based on the TMF APIs data model
  • Using Apigee’s language constructs available in the developer portal, a policy is defined that extracts the parameters from the payload of the requests
  • Each APIs data model is interpreted in a specific way, creating a workflow policy

For example: To initiate a Heal operation on a network service, the IT systems trigger TMF API 664 Resource Activation and Configuration by specifying the Heal action. The request payload has the following parameters:

  • resourceFunction – Identifier of Network service to be healed
  • healPolicy – A set of custom scripts/policy to trigger healing
  • plus additional parameters

Stage 2

  • Once the TMF API payload has been extracted and their parameters are interpreted (based on the workflow defined in Step 1) actions are triggered by making an API call. Apigee adds the Network identifier parameters extracted in Step 1 in the API payload to the south bound systems.
  • Each NBI API call is converted into a Southbound Interface API call to the relevant system that will implement the requests. In this case, the main southbound systems were Cloudify Orchestration.

For example: the network identifier in Step 1 identifies the exact deployment instances in the underlying infrastructure environment. Apigee invokes a Cloudify API call to trigger the specific action such as Heal on the network resource.

Since all the operations are designed to be asynchronous, Apigee maintains a transaction state and waits for a response from Cloudify for a completion of the heal action. Once the heal action is completed, a response is sent back to the originating Northbound API using the same API proxy endpoint.


The Result

Utilising Apigee’s API translation mechanism, we were able to demonstrate all the customer use cases that involve NBI system integration with the platform.

The following TMF APIs were implemented:

  • TMF 641 Service Activation and Ordering
  • TMF 633 Service Catalog
  • TMF 664 Resource Function Activation and Configuration

As a result of this exercise, API developers now have a better idea of the extensive set of constructs in Apigee which APIs can be built to develop applications across any technology domain.


OTHER APIGEE CASE STUDIES

Become more agile.
Get a tailored solution built just for you.

Find Out More

The post Implementing TMF APIs in Apigee appeared first on Aptira.

by Aptira at August 29, 2019 01:03 PM

August 28, 2019

OpenStack Superuser

A Global Celebration “For The Love of Open!”

Since July 2010, the global community has celebrated the OpenStack project’s birthday. User Groups all over the world have hosted celebrations around July – August, presenting slide decks, eating cupcakes, and spending time with local community members to commemorate this milestone.

Now that the OpenStack Foundation family has grown with the addition of Zuul, Kata, StarlingX, and Airship, we’ve timed the annual celebration around the establishment of OSF, which was in July 2012, and invited the entire OSF family to celebrate! This year, we’re celebrating quite a few milestones:

  • 105,000 members in 187 countries from 675 organizations, making OSF one of the largest global open source foundations in the world
  • OpenStack is one of the top 3 most active open source projects, with over 30 OpenStack public cloud providers around the world and adoption by global users including Verizon, Walmart, Comcast, Tencent and hundreds more
  • Kata Containers is supported by infrastructure donors including Google Cloud, Microsoft, Vexxhost, AWS, PackageCloud, and Packet.com
  • Zuul adoption has accelerated with speaking sessions and case studies by BMW, leboncoin, GoDaddy, the OpenStack Foundation and more
  • StarlingX adoption by China UnionPay, a leading financial institution in China, who will be sharing their use case at the Shanghai Summit
  • Airship elected Technical and Working Committees, and has received ecosystem support from companies including AT&T, Mirantis and SUSE

Photo: Korea User Group

25 User Groups from 20 different countries all over the world celebrated:
Missed your local group’s celebration? Stay in touch and join our OSF Meetup network!

China Open Infrastructure
DRCongo OpenStack Group
Indonesia OpenStack User Group
Japan OpenStack User Group
Korea User Group
OpenInfra Lowry Saxon
Open Infrastructure LA
Open Infrastructure Mexico City
Open Infrastructure San Diego
OpenStack Austin, Texas
OpenStack Bangladesh
OpenStack Benin User Group
OpenStack Bucharest, Romania Meetup
OpenStack Côte d’Ivoire
OpenStack Ghana User Group
OpenStack Guatemala User Group
OpenStack Malaysia User Group
OpenStack Meetup Group & SF Bay Cloud Native Containers
OpenStack Nigeria User Group
OpenStack Thailand User Group
OpenStack & OpenInfra Portland, Oregon
OpenStack & OpenInfra Russia Moscow
Tunisia OpenStack User Group
Vietnam User Group
Virginia OpenStack User Group

Photo: Indonesia OpenStack User Group

The User Groups gathered in a variety of sizes, with the largest attracting 200 attendees in Indonesia. Community members gave presentations, handed out awards, printed t-shirts and stickers, and sang birthday songs. From the pictures and feedback received, everyone thoroughly enjoyed their celebrations.

Thank you to all the organizers of the User Groups for bringing your local communities together. To see how other User Groups celebrated, check out the pictures on twitter and flickr.

If you’re looking to join the next meetup in your area, connect with your local User Group here.

Be sure to read up on other blogs and articles written by the Bangladesh and China User Groups.

The post A Global Celebration “For The Love of Open!” appeared first on Superuser.

by Ashlee Ferguson and Ashleigh Gregory at August 28, 2019 03:54 PM

Aptira

Apigee Central Logging

Aptira Apigee Central Logging

Completing a full-stack Private Cloud Evaluation is no mean feat. Central Logging by capturing and correlating information from all components, trapping and logging all required data, all without completing a significant amount of custom development – this is where innovative ideas are made.


The Challenge

One of the key success criteria for this customer’s full stack Private Cloud Evaluation was the detailed instrumentation of all components, providing fine-grained visibility on the interworking of all components in the solution on each test use-case. This required capturing all API calls, the data that was passed over each call, and the resulting responses. There were as many as 10 external integration points and multiple inter-component interfaces across which logs had to be captured and then corelated.

Aptira needed to identify a mechanism to trap these API calls and to log the required data.

The components used in this evaluation included OpenStack, Cloudify, OpenDayLight SDN Controller and TickStack. Although each component had its own logging mechanism, capturing the data flow between these different components with single logging mechanism was difficult. It looked like a significant amount of custom development would be required.


The Aptira Solution

Aptira’s Solutionauts came up with an innovative idea: use the API Gateway component of the solution to implement this central logging capability. This approach would remove the need for significant custom development and avoid the introduction of tools that were only used for the evaluation and had no place in a production environment.

The API gateway used in the solution was the on-premises version of Google Cloud’s Apigee API Platform, so we had all the capability we needed to implement this idea.

Apigee was already configured to manage the external APIs, and we were able to configure Apigee to manage the integration point between multiple interconnected components. Multiple Apigee proxies were created and deployed at all the integration points across the solution. Native APIs of all the components were integrated as backend service endpoints for the APIs managed by Apigee. Apart from using Apigee’s functionality – API rate limiting, API translation – we extended the Apigee’s logging policy capabilities as a central logging mechanism. This enabled us to capture all the required interface logs across all the components which we then used for the purpose of monitoring and performance analysis.

The power and capability of Apigee provided all the features we needed to implement the desired central logging functionality. All the required data was captured by invoking Apigee’s REST APIs with no involvement of 3rd party custom interfaces.


The Result

Once fully implemented, this central logging mechanism operated smoothly in parallel with the functional API calls that occurred when the system was operating and performing the evaluation use cases and we were able to successfully verify the operation of all use case functions.


OTHER APIGEE CASE STUDIES

Take control of your Cloud.
Get a customised Cloud strategy today.

Learn More

The post Apigee Central Logging appeared first on Aptira.

by Aptira at August 28, 2019 01:06 PM

August 27, 2019

Aptira

Apigee: On-Prem Vs SaaS

A large APAC Telco is building an Orchestration platform to orchestrate workloads that will be deployed in their Private Cloud infrastructure and spread across multiple data centers. Apigee can make this large project relatively simple – but which version is better suited? On-Premises or SaaS?


The Challenge

In order to efficiently Orchestrate such large workloads, the customer has requested a common API layer to control and manage traffic between multiple systems and the Orchestration platform. These systems include Operations Support Systems (OSS), Business Support Systems (BSS), Analytics, Product Ordering systems and a WAN Controller. They would also like to expose certain API’s to external partners via a web-based API Portal, and have a long list of feature requirements, including: Rate Limiting, Data translation and flexible deployment options and the API Portal.


The Aptira Solution

We have selected Apigee for this project due to its extensive set of features that satisfied the customer requirements. Aptira designed a deployment architecture for Apigee, taking into consideration the volume of API traffic from the many integration points, Tenancy, Security and Networking.

Apigee out of the box supports 2 types of deployment – Software-as-a-Service (SaaS) and the On-Premises version. The SaaS version satisfied most of the customer requirements and reduced total cost of ownership. However, the design had some major complexities which needed to be addressed.

The first complexity is the platform integration over their corporate network. In other words, the data traffic from the SaaS instance to the orchestration platform had to be sent over a secure VPN tunnel. Which, depending on the customer’s environment may include multiple systems/hops. This would cause a significant impact on the API response time.

Secondly, the customer has defined a set of regulatory compliance requirements to be validated for the whole orchestration platform. These requirements are often driven by the government organizations to host their workloads. These workloads would often involve software systems to be integrated with customer systems (hardware equipments or software) which are often hosted within customer’s environment. Such integration is easier to manage in On-Prem versions by customizing deployments by using 3rd party components for integration. Also, SaaS software versions are designed using standard security mechanisms, a compliance to these requirements would require customization in the software components. The problem magnifies if there are Multi-tenant workloads are to be hosted which would increase the customization effort. This in turn would introduce dependency on the vendor and software’s release cycle.

To overcome these two major complexities, Aptira decided to use the On-Premises version of Apigee. The On-Premises version includes an automated mechanism to deploy its sub-systems. This provided control of the infrastructure resources on which they are deployed and allowed fine tuning of resources to host its sub-systems according to the API traffic needs.

Apigee’s automated deployment mechanism has provided complete control over its deployment and the configuration of sub-systems. It is relatively easy to make any customizations to the software components should any new requirements arise since it doesn’t involve a vendor. It is also easier to integrate with co-located systems since the data transfer over the internal network is much faster thereby reducing the API response time.

The benefits of the on-prem deployment of Apigee are balanced against some additional considerations that are absent in the SaaS version. For example, Operations and Maintenance, Resource allocation and Validation. However, the customer had a strong preference for the On-Premises version as they had already completed an independent assessment of the technology for their requirements. Therefore, we could assume that they had already accepted these overheads.

From an integration point of view, we integrated Apigee with Orchestration specific platform systems and the customers environment systems:

  • Cloudify: Service Orchestrator/NFVO
  • TICKStack: event management Analytics engine
  • WAN SDN controller
  • OSS/BSS (Simulated using POSTMAN)

For each integration point, an API proxy endpoint has been created by taking into consideration the security policies that each API endpoint requires. With automation tools in place it is easier to maintain the software and handle operations such as upgrades and disaster recovery. Also, with proper capacity planning and budgeting most of the additional considerations can be adequately handled.

As this project is relatively new for the customer, their team came onboard quickly, seamlessly integrated with the Aptira’s project collaboration processes, and addressed each requirement in the solution space, thereby helping us resolve the queries faster during the design phase. It is also worth noting that the support Aptira received from Apigee staff has been extremely beneficial in providing the required outcome in a timely fashion for this solution.


The Result

Aptira designed the On-Prem Apigee deployment meeting all customer requirements and taking into account all considerations mentioned above. The design not only had seamless integration between all systems using the API gateway’s mechanism but also required minimal changes in the customer’s environment.

Aptira implemented a full-stack solution configuration with Apigee as the system-wide API Gateway that enabled its capabilities to be validated by live execution of telco workloads.


OTHER APIGEE CASE STUDIES

Take control of your Cloud.
Get a customised Cloud strategy today.

Learn More

The post Apigee: On-Prem Vs SaaS appeared first on Aptira.

by Aptira at August 27, 2019 01:43 PM

Galera Cluster by Codership

Galera Cluster hiring for Quality Assurance Engineer

Do you think Quality Assurance (QA) is more than the simplistic view of bug hunting? Do you believe that QA is important to the entire software development lifecycle and want to focus on failure testing, process and performance control, as well as best practice adoption? Do you enjoy doing performance benchmarks and noticing regressions? Do you like to write about it, from internal reports to external blog posts?

Then why not take up a challenge at Codership, the makers of Galera Cluster, as we are looking for a Galera Cluster QA Engineer (job description at link).

We’re looking for someone who is able to work remotely, join a company meeting at least once per year, be comfortable with the use of Slack and email (asynchronous communication, for developing our virtually synchronous replication solution!), but most importantly enjoy testing the application with a methodical approach. You will also get to verify bugs reported by users. And let us not forget, this job requires good knowledge of MySQL and MariaDB Server, where we have our replication layer in.

Please send your CV to jobs@galeracluster.com. Looking forward to hearing from you!

by Colin Charles at August 27, 2019 08:45 AM

August 26, 2019

Ed Leafe

Moving On

It’s been a great run, but my days in the OpenStack world are coming to an end. As some of you know already, I have accepted an offer to work for DataRobot. I only know bits and pieces of what I will be working on there, but one thing’s for sure: it won’t be on … Continue reading "Moving On"

by ed at August 26, 2019 03:06 PM

Aptira

Apigee Service Orchestration and Integration

Aptira Apigee Service Orchestration

The End-to-End Orchestration (E2e) of Services is achieved when the lifecycle events of network services are managed at the infrastructure level and managed across multiple domains. This customer requires lifecycle orchestration of services across multiple vendor specific OSS/BSS systems.


The Challenge

One of our Communication Service Provider (CSP) customers required E2E Orchestration of Network services across different technology domains, including integration with their external Business system. Since each OSS/BSS system is implemented with its own set of interfaces, there are significant challenges when integrating multiple systems with a common orchestration platform. For this reason, a common API layer is required to handle such patterns.


The Aptira Solution

The key components of the Orchestration platform being developed include:

  • Network Functions Virtualisation Orchestrator (NFVO)
    • The NFVO is responsible for handling the resource orchestration of Network service modelling using TOSCA across Network Functions Virtualisation Infrastructure (NFVI) domains. NFVO has the visibility of all the South bound components and manages the lifecycle of services at the infrastructure level. I.e. at Virtual IP Multimedia System (VIM) and Software Defined Networking (SDN). However, it doesn’t have information across technology domains. In this customer’s platform, the NFVO is implemented using Cloudify.
  • Service Orchestrator (SO)
    • The SO handles the E2E orchestration of network services across different technology domains such as Wireline, Radio, Evolved Packet Core, Access Networks by integrating with Product ordering systems, OSS/BSS from different vendors that support different interfaces such as REST, SOAP or any proprietary interfaces. In this customer’s platform the SO is implemented using Cloudify.

The SO and NFVO communicate with each other extensively, but they also must integrate with multiple external systems. By convention, interfaces with OSS/BSS and other service-level systems are called North Bound Interfaces or NBI. Similarly, interfaces with network-level resource management systems such as network elements, WAN SDN controllers and the like are called South bound Interfaces or SI.

Another of the customers’ requirements is that access to some limited parts of system functionality must be available to third parties. They wanted to expose public API’s via an API Portal and allow trusted / qualified third parties to access API documentation and Sandpit environments.

This all leads to a significant number of API’s that are implemented in any orchestration platform. To seamlessly integrate orchestration platform with external systems and expose public API’s in a controlled manner, a common API management layer is required.

An API Management layer mediates between multiple integrated systems that communicate via application calls. Aptira determined that Google Cloud’s Apigee API Platform was the right product to meet these requirements.

Apigee is used as an API gateway in the orchestration platform. In addition to its standard set of features such as API rate limiting, Security handling and API analytics, it comes with rich set of language constructs to create very specialized API transforms to mediate between the API’s of different systems. The NBI in this solution implements standard TM Forum Open APIs. The core of the API layer is implemented using Apigee.

To demonstrate an end-to-end orchestration use case, we simulated an environment where Apigee is integrated with Cloudify (as NFVO) by defining business workflows in Apigee that translate the TMF API calls from North bound systems to NFVI domain specific orchestration API calls. For instance, orchestration of Firewall Service and a core telco vIMS service was demonstrated using a single TMF Service Order and Activation API.

The following APIs were implemented:

  • TMF 641 Service Activation and Ordering
  • TMF 633 Service Catalog
  • TMF 664 Resource Function Activation and Configuration

The Southbound Interfaces of the Service Orchestrator are represented by the Orchestration layer at the NFVi domain (i.e. NFVO) and at the Transport domain (i.e. WAN SDN controller). In order to demonstrate an end-to-end orchestration use case that involves the Transport domain, we setup a WAN topology using asset of OVS switches, integrating Cloudify in the SO layer with a WAN SDN Controller.

The Ordering systems adds details including service level details, SLAs and waypoints in the TMF 641 request to setup a WAN service such as VPN or MPLS service. Apigee, using the workflow mechanism described above, translates the API request to the Orchestration request to the Cloudify in SO layer by adding the required parameters to instantiate a VPN service.


The Result

The result is an API framework that can be extended to develop other complex business use cases so that it not only orchestrates services at the infrastructure level but also makes integration with systems such as product ordering systems seamless.

The Apigee API platform was able to handle both generic API management tasks but also was able to implement deeply specialized telco requirements. This ultimately helps the customer to roll out services faster, meeting their business objectives and allowing them to rapidly adapt to changes in the future.


OTHER APIGEE CASE STUDIES

Remove the complexity of networking at scale.
Learn more about our SDN & NFV solutions.

Learn More

The post Apigee Service Orchestration and Integration appeared first on Aptira.

by Aptira at August 26, 2019 01:12 PM

August 22, 2019

Galera Cluster by Codership

Running Galera Cluster on Microsoft Azure and comparing it to their hosted services (EMEA and USA webinar)

Do you want to run Galera Cluster in the Microsoft cloud? Why not learn to setup a 3-node Galera Cluster using Microsoft Azure Compute Virtual Machines, and run it yourself. In this webinar, we will cover the steps to do this, with a demonstration of how easy it is for you to do.

In addition, we will cover why you may want to run a 3-node (or more) Galera Cluster (active-active multi-master clusters) instead of (or in addition to) using Azure Database for MySQL or MariaDB. We will also cover cost comparisons. 

Join us and learn about storage options, backup & recovery, as well as monitoring & metrics options for the “roll your own Galera Cluster” in Azure.

EMEA webinar 10th of September 1-2 PM CEST (Central European Time)
JOIN THE EMEA WEBINAR

USA webinar 10th of September 9-10 AM PDT (Pacific Daylight Time)
JOIN THE USA WEBINAR

Presenter: Colin Charles, Galera Cluster Chief Evangelist, Codership


by Sakari Keskitalo at August 22, 2019 11:40 AM

August 16, 2019

Chris Dent

Placement Update 19-32

Here's placement update 19-32. There will be no update 33; I'm going to take next week off. If there are Placement-related issues that need immediate attention please speak with any of Eric Fried (efried), Balazs Gibizer (gibi), or Tetsuro Nakamura (tetsuro).

Most Important

Same as last week: The main things on the Placement radar are implementing Consumer Types and cleanups, performance analysis, and documentation related to nested resource providers.

A thing we should place on the "important" list is bringing the osc placement plugin up to date. We also need to discuss what would we would like the plugin to be. Is it required that it have ways to perform all the functionality of the API, or is it about providing ways to do what humans need to do with the placement API? Is there a difference?

We decided that consumer types is medium priority: The nova-side use of the functionality is not going to happen in Train, but it would be nice to have the placement-side ready when U opens. The primary person working on it, tssurya, is spread pretty thin so it might not happen unless someone else has the cycles to give it some attention.

On the documentation front, we realized during some performance work last week that it easy to have an incorrect grasp of how same_subtree works when there are more than two groups involved. It is critical that we create good "how to use" documentation for this and other advanced placement features. Not only can it be easy to get wrong, it can be challenge to see that you've got it wrong (the failure mode is "more results, only some of which you actually wanted").

What's Changed

  • Yet more performance fixes are in the process of merging. Most of these are related to getting _merge_candidates and _build_provider_summaries to have less impact. The fixes are generally associated with avoiding duplicate work by generating dicts of reusable objects earlier in the request. This is possible because of the relatively new RequestWideSearchContext. In a request that returns many provider summaries _build_provider_summaries continues to have a significant impact because it has to create many objects but overall everything is much less heavyweight. More on performance in Themes, below.

  • The combination of all these performance fixes, and because of microversions, makes it reasonable for anyone running placement in a resource constrained environment (or simply wanting things to be faster) to consider running Train placement with any release of OpenStack. Obviously you should test it first, but it is worth investigating. More information on how to achieve this can be found in the upgrade to stein docs

Stories/Bugs

(Numbers in () are the change since the last pupdate.)

There are 23 (1) stories in the placement group. 0 (0) are untagged. 4 (1) are bugs. 4 (0) are cleanups. 11 (0) are rfes. 4 (0) are docs.

If you're interested in helping out with placement, those stories are good places to look.

osc-placement

osc-placement is currently behind by 12 microversions.

  • https://review.opendev.org/666542 Add support for multiple member_of. There's been some useful discussion about how to achieve this, and a consensus has emerged on how to get the best results.

  • https://review.opendev.org/640898 Adds a new '--amend' option which can update resource provider inventory without requiring the user to pass a full replacement for inventory. This has been broken up into three patches to help with review.

Main Themes

Consumer Types

Adding a type to consumers will allow them to be grouped for various purposes, including quota accounting.

As mentioned above, this is currently paused while other things take priority. If you have time that you could spend on this please respond here expressing that interest.

Cleanup

Cleanup is an overarching theme related to improving documentation, performance and the maintainability of the code. The changes we are making this cycle are fairly complex to use and are fairly complex to write, so it is good that we're going to have plenty of time to clean and clarify all these things.

As said above, there's lots of performance work in progress. We'll need to make a similar effort with regard to docs. For example, all of the coders involved in the creation and review of the same_subtree functionality struggle to explain, clearly and simply, how it will work in a variety of situations. We need to enumerate the situations and the outcomes, in documentation.

One outcome of this work will be something like a Deployment Considerations document to help people choose how to tweak their placement deployment to match their needs. The simple answer is use more web servers and more database servers, but that's often very wasteful.

On the performance front, there is one major area of impact which has not received much attention yet. When requesting allocation candidates (or resource providers) that will return many results the cost of JSON serialization is just under one quarter of the processing time. This is to be expected when the response body is 2379k big, and 154000 lines long (when pretty printed) for 7000 provider summaries and 2000 allocation requests.

But there are ways to fix it. One is to ask more focused questions (so fewer results are expected). Another is to limit=N the results (but this can lead to issues with migrations).

Another is to use a different JSON serializer. Should we do that? It make a big difference with large result sets (which will be common in big and sparse clouds).

Other Placement

Miscellaneous changes can be found in the usual place.

There are two os-traits changes being discussed. And zero os-resource-classes changes.

Other Service Users

New discoveries are added to the end. Merged stuff is removed. Anything that has had no activity in 4 weeks has been removed.

End

Have a good next week.

by Chris Dent at August 16, 2019 02:34 PM

August 13, 2019

CERN Tech Blog

Nova support for a large Ironic deployment

CERN runs OpenStack Ironic to provision all the new hardware deliveries and the on-demand requests for baremetal instances. It replaced already most of the workflows and tools to manage the lifecycle of physical nodes, but we continue to work with the upstream community to improve the pre-production burn-in, the up-front performance validation and the integration of retirement workflows. During the last 2 years the service has grown from 0 to ~3100 physical nodes.

by CERN (techblog-contact@cern.ch) at August 13, 2019 02:00 PM

RDO

Community Blog Round Up 13 August 2019

Making Host and OpenStack iSCSI devices play nice together by geguileo

OpenStack services assume that they are the sole owners of the iSCSI connections to the iSCSI portal-targets generated by the Cinder driver, and that is fine 98% of the time, but what happens when we also want to have other non-OpenStack iSCSI volumes from that same storage system present on boot? In OpenStack the OS-Brick […]

Read more at https://gorka.eguileor.com/host-iscsi-devices/

Service Assurance on small OpenShift Cluster by mrunge

This article is intended to give an overview on how to test the

Read more at http://www.matthias-runge.de/2019/07/09/Service-Assurance-on-ocp/

Notes on testing a tripleo-common mistral patch by JohnLikesOpenStack

I recently ran into bug 1834094 and wanted to test the proposed fix. These are my notes if I have to do this again.

Read more at http://blog.johnlikesopenstack.com/2019/07/notes-on-testing-tripleo-common-mistral.html

Developer workflow with TripleO by Emilien

In this post we’ll see how one can use TripleO for developing & testing changes into OpenStack Python-based projects (e.g. Keystone).

Read more at https://my1.fr/blog/developer-workflow-with-tripleo/

Avoid rebase hell: squashing without rebasing by OddBit

You’re working on a pull request. You’ve been working on a pull request for a while, and due to lack of sleep or inebriation you’ve been merging changes into your feature branch rather than rebasing. You now have a pull request that looks like this (I’ve marked merge commits with the text [merge]):

Read more at https://blog.oddbit.com/post/2019-06-17-avoid-rebase-hell-squashing-wi/

Git Etiquette: Commit messages and pull requests by OddBit

Always work on a branch (never commit on master) When working with an upstream codebase, always make your changes on a feature branch rather than your local master branch. This will make it easier to keep your local master branch current with respect to upstream, and can help avoid situations in which you accidentally overwrite your local changes or introduce unnecessary merge commits into your history.

Read more at https://blog.oddbit.com/post/2019-06-14-git-etiquette-commit-messages/

Running Keystone with Docker Compose by OddBit

In this article, we will look at what is necessary to run OpenStack’s Keystone service (and the requisite database server) in containers using Docker Compose.

Read more at https://blog.oddbit.com/post/2019-06-07-running-keystone-with-docker-c/

The Kubernetes in a box project by Carlos Camacho

Implementing cloud computing solutions that runs in hybrid environments might be the final solution when comes to finding the best benefits/cost ratio.

Read more at https://www.anstack.com/blog/2019/05/21/kubebox.html

Running Relax-and-Recover to save your OpenStack deployment by Carlos Camacho

ReaR is a pretty impressive disaster recovery solution for Linux. Relax-and-Recover, creates both a bootable rescue image and a backup of the associated files you choose.

Read more at https://www.anstack.com/blog/2019/05/20/relax-and-recover-backups.html

by Rain Leander at August 13, 2019 08:00 AM

August 12, 2019

OpenStack Superuser

Inside open infrastructure: The latest from the OpenStack Foundation

Welcome to the latest edition of the OpenStack Foundation Open Infrastructure newsletter, a digest of the latest developments and activities across open infrastructure projects, events and users. Sign up to receive the newsletter and email community@openstack.org to contribute.

Spotlight on: The Open Infrastructure Summit Shanghai Agenda

The agenda for the Open Infrastructure Summit Shanghai went live this week! Join the global community in Shanghai from November 4-6 to experience:

  • Keynote and breakout sessions spanning 30+ open source projects from technical community leaders and organizations including:
    • Managing a growing OpenStack cloud in production at ByteDance (creator of TikTok) who runs an OpenStack environment of 300,000 cores and is still rapidly growing at a rate of 30,000 CPU cores per month
    • Monitoring and Autoscaling Features for Self -Managed Kubernetes clusters at WalmartLabs
    • Secured edge infrastructure for Contactless Payment System with StarlingX at China UnionPay
    • How to run a public cloud on OpenStack from China Mobile
    • Integrating RabbitMQ with OpenStack at LINE, the most popular messaging app in Japan
    • Project updates and onboarding from OSF projects: Airship, Kata Containers, OpenStack, StarlingX, and Zuul.
    • Collaborative sessions at the Forum, where open infrastructure operators and upstream developers will gather to jointly chart the future of open source infrastructure, discussing topics ranging from upgrades to networking models and how to get started contributing.
    • Hands-on training around open source technologies directly from the developers and operators building the software.
    • The Summit will be followed by the Project Teams Gathering (PTG): various open source contributor teams and working groups will meet to get work done, with a special focus this PTG around onboarding new team members.

Now, it’s time to register you and your team for the Shanghai Summit before prices increase next week on August 14 at 11:59pm PT (August 15 at 2:59pm China Standard Time). If your organization is recruiting new talent or wanting to share news around a new product launch, join the Summit as a sponsor by reaching out to summit@openstack.org.

OpenStack Foundation:

Open Infrastructure Summit:

  • The Community Contributor Award nominations are open until October 20th at 7:00 UTC. Community members from any Foundation project from Airship, Kata Containers, OpenStack, StarlingX and Zuul can be nominated! Recipients will be announced in Shanghai at the Summit.
  • Registration is open. Summit tickets grant you access to the PTG. Save on tickets by purchasing them now at the early bird price. There are 2 ways to register – in USD or in RMB (with fapiao)
  • Know an organization that’s innovating with open infrastructure? Nominate them for the Superuser Awards by September 27.
  • Need a Chinese Visa? Start the process now! Information here.
  • Have your brand in the spotlight by sponsoring the Summit! Learn more here.
  • The Travel Support Program is also available. Apply before August 13!

Project Teams Gathering:

  • PTG attendance surveys have been sent out to project/group/team leads and responses are due August 11. If you are a team lead and missed the email with the survey, please contact Kendall Nelson (knelson@openstack.org) ASAP.
  • Registration is open. PTG tickets are included with Summit registration. Save on tickets by purchasing them now at the early bird price. There are 2 ways to register – in USD or in RMB (with fapiao)
  • The Travel Support Program is also available. Apply before August 13!

Airship: Elevate Your Infrastructure

  • Directly following the Technical Committee election, the Airship project is holding its first Working Committee election. The Working Committee is intended to help influence the project strategy, help arbitrate when there is a disagreement between Core Reviewers within a single project or between Airship projects, define the project core principles, perform marketing and communications, and finally help provide product management as well as ecosystem support. The close of the Working Committee polling will mark the full transition to Airship being a community governed open source project with 100% elected leadership.

Kata Containers: The speed of containers, the security of VMs

  • Kata Containers 1.8 release landed on July 24. This latest release upgrades the QEMU hypervisor from a QEMU-lite base to upstream QEMU 4.0. Kata templating code is updated to make use of the upstream x-ignored-shared. Firecracker hypervisor is also updated to 0.17, and Kata now has support for using Firecracker’s jailer, adding extra security isolation for the VMM on the host. Fixes and usability improvements for virtio-fs have also been introduced.
  • The Kata Containers 1.9 Alpha release was also created. In the upcoming 1.9 release, which is expected to land in mid-October, Kata will introduce support for a new hypervisor: ACRN. View the latest Kata Containers releases here.
    The Kata community is excited to again have a significant presence at the upcoming Open Infrastructure Summit with 5 talks accepted. Check out the full line-up of Kata sessions here.

OpenStack: Open Source Software for Creating Private and Public Clouds

  • The OpenStack User Committee (UC) is tasked with representing OpenStack users in the project governance. Two UC seats will soon be renewed. The nomination period is currently underway.
  • The next OpenStack release (planned for October 16) is called Train. But what should be the name of the release after that? Our release naming process calls for a name starting with the letter U, ideally related to a geographic feature close to Shanghai, China. The community proposed several options, and a community poll will soon be opened. Watch out for it!
  • Each release cycle, we define common goals for the OpenStack project teams. The goal selection process for the ‘U’ release has started: please read Ghanshyam Mann’s openstack-discuss email if you want to make suggestions.
  • A security vulnerability in Nova Compute has been announced for all current versions, so anyone running it should make sure their deployment is updated with the corresponding release’s fix as soon as possible.

StarlingX: A Fully Featured Cloud for the Distributed Edge

  • See the list of StarlingX sessions on the upcoming Open infrastructure Summit in Shanghai here!
  • In preparation for the 2.0 release the community cut RC1 with a new branch this week. The testing of the stable codebase is still ongoing to assure high code quality when the release comes out at the end of August.

Zuul: Stop Merging Broken Code

Find OSF at these Open Infrastructure Community Events

August

September

October

November

Questions / feedback / contribute

This newsletter is written and edited by the OpenStack Foundation staff to highlight open infrastructure communities. We want to hear from you! If you have feedback, news or stories that you want to share, reach us through community@openstack.org . To receive the newsletter, sign up here.

The post Inside open infrastructure: The latest from the OpenStack Foundation appeared first on Superuser.

by Allison Price at August 12, 2019 02:00 PM

August 09, 2019

Chris Dent

Placement Update 19-31

Pupdate 19-31. No bromides today.

Most Important

Same as last week: The main things on the Placement radar are implementing Consumer Types and cleanups, performance analysis, and documentation related to nested resource providers.

We need to decide how much of a priority consumer types support is. I've taken the task of asking around with the various interested parties.

What's Changed

  • A more complex nested topology is now being used in the nested-perfload check job, and both that and the non-nest perfload run apache benchmark at the end. When you make changes you can have a look at the results of the placement-perfload and placement-nested-perfload gate jobs to see if there has been a performance impact. Keep in mind the numbers are only a guide. The performance characteristics of VMs from different CI providers varies wildly.

  • A stack of several performance related improvements has merged, with still more to come. I've written a separate Placement Performance Analysis that summarizes some of the changes. Many of these may be useful for other services. Each iteration reveals another opportunity.

  • In some environments placement will receive a URL of '' when '/' is expected. Auth handling for version control needs to handle this.

  • osc-placmeent 1.6.0 is in the process of being released.

Stories/Bugs

(Numbers in () are the change since the last pupdate.)

There are 22 (-1) stories in the placement group. 0 (0) are untagged. 3 (0) are bugs. 4 (-1) are cleanups. 11 (0) are rfes. 4 (0) are docs.

If you're interested in helping out with placement, those stories are good places to look.

osc-placement

osc-placement is currently behind by 12 microversions.

  • https://review.opendev.org/666542 Add support for multiple member_of. There's been some useful discussion about how to achieve this, and a consensus has emerged on how to get the best results.

  • https://review.opendev.org/640898 Adds a new '--amend' option which can update resource provider inventory without requiring the user to pass a full replacement for inventory. This has been broken up into three patches to help with review.

Main Themes

Consumer Types

Adding a type to consumers will allow them to be grouped for various purposes, including quota accounting.

Cleanup

Cleanup is an overarching theme related to improving documentation, performance and the maintainability of the code. The changes we are making this cycle are fairly complex to use and are fairly complex to write, so it is good that we're going to have plenty of time to clean and clarify all these things.

As said above, there's lots of performance work in progress. We'll need to make a similar effort with regard to docs.

One outcome of this work will be something like a Deployment Considerations document to help people choose how to tweak their placement deployment to match their needs. The simple answer is use more web servers and more database servers, but that's often very wasteful.

Other Placement

Miscellaneous changes can be found in the usual place.

There are two os-traits changes being discussed. And zero os-resource-classes changes.

Other Service Users

New discoveries are added to the end. Merged stuff is removed. Anything that has had no activity in 4 weeks has been removed.

End

Somewhere in this performance work is a lesson for life: Every time I think we've reached the bottom of the "easy stuff", I find yet another bit of easy stuff.

by Chris Dent at August 09, 2019 02:07 PM

Galera Cluster by Codership

Setting Up a Galera Cluster on Amazon AWS EC2

Through Amazon Web Services (AWS), you can create virtual servers (i.e., instances). You can install database and Galera software on them. In this article, we’ll create three nodes, the minimum recommended for a healthy cluster, and configure them to use Galera Cluster.

Incidentally, there is a more detailed version of this article in the Tutorial section of our Library.

Assumptions & Preparation

We’re assuming you have an AWS account and know the basics of the EC2 (Elastic Compute Cloud) platform.

To access the nodes, you’ll need an encryption key. Create a new one specifically for Galera, using a tool such as ssh-keygen. Add that key to AWS, under Key Pairs.

Creating AWS Instances

To start creating instances in AWS, click on Instances, then Launch Instances. First, choose the operating system distribution. We chose here “CentOS 7 (x86_64) – with Updates HVM”.

Next, choose an instance type. Because we’re using this cluster as a training tool, we chose t2.micro, which is free for a year.

Next is the instance details. In the first box, for the number of instances, enter 3. You can leave everything else at their default values.

Adding storage is next. If you chose the free tier, the default is 8 GB. For training, this is plenty. You can click past the screen on Adding Tags.

Next is Security Group (i.e., AWS’s firewall). Create a new one for Galera and add an SSH rule to allow you to log in. For the source, choose My IP.

With that done, click on Review and Launch to see the choices you made. If everything is fine, click Launch.

A message will ask for an encryption key. Click Choose an Existing Key Pair and select the Galera one. Read and accept the warning and then click Launch Instance.

When all three nodes are running, label them (e.g., galera1). Check each instance to get their external IP addresses.

Installing Software on Nodes

You’re now ready to install the database and Galera software. Use ssh to log into each node through their external IP addresses, using your encryption key.

Install rsync, which Galera uses to synchronize new nodes, and firewalld on each node with a package-management utility like yum:

sudo yum -y install rync firewalld

The database is next. You might install MySQL or MariaDB, depending on your preferences. Both work well with Galera Cluster. There are several methods by which you may install the database and Galera software. For instructions on this, go to our documentation page on Installing Galera Cluster.

Configuring the Nodes

You’ll need to edit the database configuration file (i.e., /etc/my.cnf.d/server.cnf) on each node. There are some parameters related to MySQL or MariaDB and the InnoDB storage engine that you might want to add for better performance and troubleshooting. See the Tutorial for these. As for Galera, add a [galera] section to the configuration file:

[galera]
wsrep_on=ON
wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so

wsrep_node_name='galera1'
wsrep_node_address="172.31.19.208"

wsrep_cluster_name='galera-training'
wsrep_cluster_address="gcomm://172.31.19.208,172.31.26.197,172.31.15.54"

wsrep_provider_options="gcache.size=300M; gcache.page_size=300M"
wsrep_slave_threads=4
wsrep_sst_method=rsync

The wsrep_on enables Galera. The file path for wsrep_provider may have to be adjusted to your server.

The wsrep_node_name needs to be unique for each node. The wsrep_node_address is the IP address for the node. For AWS, use the internal ones.

The wsrep_cluster_name is the cluster’s name. The wsrep_cluster_address contains the addresses of all nodes.

Security Settings

You now have to open certain ports. Galera Cluster uses four TCP ports: 3306 (MySQL’s default), 4444, 4567, and 4568. It also uses one UDP: 4567. For SELinux, open these ports by executing the following on each node:

semanage port -a -t mysqld_port_t -p tcp 3306
semanage port -a -t mysqld_port_t -p tcp 4444
semanage port -a -t mysqld_port_t -p tcp 4567
semanage port -a -t mysqld_port_t -p udp 4567
semanage port -a -t mysqld_port_t -p tcp 4568
semanage permissive -a mysqld_t

You’ll have to do the same for the firewall:

systemctl enable firewalld
systemctl start firewalld

firewall-cmd --zone=public --add-service=mysql --permanent
firewall-cmd --zone=public --add-port=3306/tcp --permanent
firewall-cmd --zone=public --add-port=4444/tcp --permanent
firewall-cmd --zone=public --add-port=4567/tcp --permanent
firewall-cmd --zone=public --add-port=4567/udp --permanent
firewall-cmd --zone=public --add-port=4568/tcp --permanent

firewall-cmd --reload

Now you need to add some related entries to AWS. Click Security Groups and select the Galera group. Under the Actions, select Edit Inbound Rules.

Click Add Rule and select the type, MySQL/Aurora and enter the internal IP address for the first node (e.g., 172.31.19.208/32). Next, add another rule, but this time a Custom TCP Rule for port 4444 — using the same internal address. Now add another custom TCP entry, but for port, enter “4567 – 4568”. Last, add a custom UDP entry for port 4567.

Repeat these four entries for each node, adjusting the IP addresses. When finished, click Save.

Starting Galera

When starting a new cluster, you tell the first node that it’s first by using the --wsrep-new-cluster option with mysqld. To make it easy, if you’re using MariaDB 10.4 with version 4 of Galera, you can use the galera_new_cluster script. Execute it only on the first node. This will start MySQL and Galera on that one node. On the other nodes, execute the following:

systemctl start mysql

Once MySQL has started on each, enter the line below from the command-line on one of the nodes. There’s no password yet, so just hit Enter.

mysql -p -u root -e "SHOW STATUS LIKE 'wsrep_cluster_size'"

+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 3 |
+--------------------+-------+

You can see here there are three nodes in the cluster. That’s what we want. Galera cluster was successfully installed using AWS.

by Sakari Keskitalo at August 09, 2019 06:50 AM

August 08, 2019

Chris Dent

Placement Performance Analysis

Performance has always been important to Placement. In a busy OpenStack cloud, it will receive many hits per second. Any slowness in the placement service will add to the latency present in instance creation and migration operations.

When we added support for requesting complex topologies of nested resource providers, performance took an expected hit. All along, the plan was to make it work and then make it fast. In the last few weeks members of the Placement team have been working to improve performance.

Human analysis of the code can sometimes suggest obvious areas for performance improvement but it is also very easy to be misled. It's better to use profiling and benchmarking to get accurate measurements of what code is using the most CPU and to effectively compare different revisions of the code.

I've written two other postings about how to profile WSGI apps and analyse the results. Using those strategies we've iterated through a series of changes using the following process:

  1. profile to find the most expensive chunk of code
  2. determine if it can be improved and how
  3. change the code
  4. benchmark to see if it really helps, if it does, keep it, otherwise try something else
  5. repeat

The most recent big feature added to placement was called same_subtree. It adds support for requiring that a subset of the solution set for a request be under the same ancestor resource provider. This helps to support "affinity" within a compute host (e.g., "this FPGA is under the same NUMA node as this FPGA").

What follows are some comparison numbers from benchmarks run with the commit that added same_subtree and recent master (between which several performance tweaks have been added). The test host is a Linux VM with 16 GB of RAM, 16 VCPU. Placement is running standalone (without keystone), using PostgreSQL as its database and uwsgi as the web server with the following startup

uwsgi --http :8000 --wsgi-file .tox/py37/bin/placement-api --processes 4 --threads 10

all on that same host.

Apache benchmark is run on an otherwise idle 8 core machine on the same local network. Headers are set with -H 'x-auth-token: admin' and -H 'openstack-api-version: placement latest' to drive appropriate noauth2 and microversion settings.

The server is preloaded with 7000 resource providers created using the nested-perfload topology.

The URL requested is:

GET /allocation_candidates?
     resources=DISK_GB:10&
     required=COMPUTE_VOLUME_MULTI_ATTACH&
     resources_COMPUTE=VCPU:1,MEMORY_MB:256&
     required_COMPUTE=CUSTOM_FOO&
     resources_FPGA=FPGA:1&
     group_policy=none&
     same_subtree=_COMPUTE,_FPGA

The Older Code

ab -c 1 -n 10 [the rest] (1 concurrency, 10 total requests):

Requests per second:    0.40 [#/sec] (mean)
Time per request:       2472.930 [ms] (mean)

ab -c 40 -n 400 [the rest] (40 concurrency, 400 total requests):

Requests per second:    1.46 [#/sec] (mean)
Time per request:       27454.696 [ms] (mean)

(For concerned benchmark purists: throughout this process I've also been running with thousands of requests instead of tens or hundreds to make sure that the mean values I'm getting here aren't because of the short run time. They are not. Also, not reported here, but I've also been doing benchmarks to compare how concurrent I can get before something explodes. As you might expect: as individual requests become lighter, the wider we can get.)

The New and Improved Code

(These numbers are not quite up to date. They are from a recent master but there are at least four more performance-related patches yet to merge. I'll update when that's all in.)

ab -c 1 -n 10 [the rest] (1 concurrency, 10 total requests):

Requests per second:    0.70 [#/sec] (mean)
Time per request:       1423.695 [ms] (mean)

ab -c 40 -n 400 [the rest] (40 concurrency, 400 total requests):

Requests per second:    2.90 [#/sec] (mean)
Time per request:       13772.054 [ms] (mean)

How'd We Get There?

This is a nice improvement. It may not seem like that much — over 1 second per request is rather slow in the absolute — but there is a lot happening in the background and a lot of data being returned.

One response is a complex nested JSON object of 2583330 bytes. It has 154006 lines when sent through json_pp.

There are several classes of changes that were made. These might be applicable to other environments (like yours!):

  • If using SQLAlchemy, using the RowProxy object directly, within the persistence layer, is okay and much faster that casting to a dict or namedtuple (which have interfaces the RowProxy already provides).

  • Use __slots__ in frequently used objects. It really does speed up attribute access time.

  • Profiling can often reveal sets of data that are retrieved multiple times. If you can find these and build them incrementally in the context of a single request/operation it can be a big win. See Add RequestWideSearchContext.summaries_by_id and Track usage info on RequestWideSearchContext for examples.

  • If you're doing anything with membership checking with a list and you're able to make it a set, do.

  • When using SQLAlchemy's in_ operator with a large number of values, an expanding bindparam can make a big difference in performance.

  • Implementing __copy__ on simple classes of object that are copied many times in single requests. Python's naive copy is expensive, in aggregate.

Also, not based on the recent profiling, but in earlier work comparing non-nested setups (we've gone from 1.2 seconds for a GET /allocation_candidates?resources=DISK_GB:10,VCPU:1,MEMORY_MB:256 request against 1000 providers in early January to .53 seconds now) we learned the following:

  • Unless you absolutely must (perhaps because you are doing RPC), avoid using oslo versioned objects. They add a lot of overhead for type checking and coercing when getting and setting attributes.

What's Next?

I'm pretty sure there are a lot more improvements to be made. Each pass through the steps listed above exposes another avenue for investigation. Thus far we've been able to make improvements without too much awareness of the incoming request: we've not been adding conditionals or special-cases. Adding those will probably take us into a new world of opportunities.

Most of the application time is spent interacting with the database. Little has yet been done to explore tweaking the schema (things like de-normalization) or tweaking the database configuration (threads available, cache sizes, using SSDs). All of that will have impact.

And, in the end, because Placement is a simple web application over a database, the easiest way to get more performance is to make more web and database servers and load balance them. However, that's a cop out, we should save cycles where we can. Everything is expensive at scale.

by Chris Dent at August 08, 2019 04:30 PM

Galera Cluster by Codership

Setting Up a Galera Cluster on Amazon AWS EC2

Through Amazon Web Services (AWS), you can create virtual servers (i.e., instances). You can install database and Galera software on them. In this article, we’ll create three nodes, the minimum recommended for a healthy cluster, and configure them to use Galera Cluster.

Incidentally, there is a more detailed version of this article in the Tutorial section of our Library.

Assumptions & Preparation

We’re assuming you have an AWS account and know the basics of the EC2 (Elastic Compute Cloud) platform.

To access the nodes, you’ll need an encryption key. Create a new one specifically for Galera, using a tool such as ssh-keygen. Add that key to AWS, under Key Pairs.

Creating AWS Instances

To start creating instances in AWS, click on Instances, then Launch Instances. First, choose the operating system distribution. We chose here “CentOS 7 (x86_64) – with Updates HVM”.

Next, choose an instance type. Because we’re using this cluster as a training tool, we chose t2.micro, which is free for a year.

Next is the instance details. In the first box, for the number of instances, enter 3. You can leave everything else at their default values.

Adding storage is next. If you chose the free tier, the default is 8 GB. For training, this is plenty. You can click past the screen on Adding Tags.

Next is Security Group (i.e., AWS’s firewall). Create a new one for Galera and add an SSH rule to allow you to log in. For the source, choose My IP.

With that done, click on Review and Launch to see the choices you made. If everything is fine, click Launch.

A message will ask for an encryption key. Click Choose an Existing Key Pair and select the Galera one. Read and accept the warning and then click Launch Instance.

When all three nodes are running, label them (e.g., galera1). Check each instance to get their external IP addresses.

Installing Software on Nodes

You’re now ready to install the database and Galera software. Use ssh to log into each node through their external IP addresses, using your encryption key.

Install rsync, which Galera uses to synchronize new nodes, and firewalld on each node with a package-management utility like yum:

sudo yum -y install rync firewalld

The database is next. You might install MySQL or MariaDB, depending on your preferences. Both work well with Galera Cluster. There are several methods by which you may install the database and Galera software. For instructions on this, go to our documentation page on Installing Galera Cluster.

Configuring the Nodes

You’ll need to edit the database configuration file (i.e., /etc/my.cnf.d/server.cnf) on each node. There are some parameters related to MySQL or MariaDB and the InnoDB storage engine that you might want to add for better performance and troubleshooting. See the Tutorial for these. As for Galera, add a [galera] section to the configuration file:

[galera]
wsrep_on=ON
wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so

wsrep_node_name='galera1'
wsrep_node_address="172.31.19.208"

wsrep_cluster_name='galera-training'
wsrep_cluster_address="gcomm://172.31.19.208,172.31.26.197,172.31.15.54"

wsrep_provider_options="gcache.size=300M; gcache.page_size=300M"
wsrep_slave_threads=4
wsrep_sst_method=rsync

The wsrep_on enables Galera. The file path for wsrep_provider may have to be adjusted to your server.

The wsrep_node_name needs to be unique for each node. The wsrep_node_address is the IP address for the node. For AWS, use the internal ones.

The wsrep_cluster_name is the cluster’s name. The wsrep_cluster_address contains the addresses of all nodes.

Security Settings

You now have to open certain ports. Galera Cluster uses four TCP ports: 3306 (MySQL’s default), 4444, 4567, and 4568. It also uses one UDP: 4567. For SELinux, open these ports by executing the following on each node:

semanage port -a -t mysqld_port_t -p tcp 3306
semanage port -a -t mysqld_port_t -p tcp 4444
semanage port -a -t mysqld_port_t -p tcp 4567
semanage port -a -t mysqld_port_t -p udp 4567
semanage port -a -t mysqld_port_t -p tcp 4568
semanage permissive -a mysqld_t

You’ll have to do the same for the firewall:

systemctl enable firewalld
systemctl start firewalld

firewall-cmd --zone=public --add-service=mysql --permanent
firewall-cmd --zone=public --add-port=3306/tcp --permanent
firewall-cmd --zone=public --add-port=4444/tcp --permanent
firewall-cmd --zone=public --add-port=4567/tcp --permanent
firewall-cmd --zone=public --add-port=4567/udp --permanent
firewall-cmd --zone=public --add-port=4568/tcp --permanent

firewall-cmd --reload

Now you need to add some related entries to AWS. Click Security Groups and select the Galera group. Under the Actions, select Edit Inbound Rules.

Click Add Rule and select the type, MySQL/Aurora and enter the internal IP address for the first node (e.g., 172.31.19.208/32). Next, add another rule, but this time a Custom TCP Rule for port 4444 — using the same internal address. Now add another custom TCP entry, but for port, enter “4567 – 4568”. Last, add a custom UDP entry for port 4567.

Repeat these four entries for each node, adjusting the IP addresses. When finished, click Save.

Starting Galera

When starting a new cluster, you tell the first node that it’s first by using the --wsrep-new-cluster option with mysqld. To make it easy, if you’re using MariaDB 10.4 with version 4 of Galera, you can use the galera_new_cluster script. Execute it only on the first node. This will start MySQL and Galera on that one node. On the other nodes, execute the following:

systemctl start mysql

Once MySQL has started on each, enter the line below from the command-line on one of the nodes. There’s no password yet, so just hit Enter.

mysql -p -u root -e "SHOW STATUS LIKE 'wsrep_cluster_size'"

+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 3 |
+--------------------+-------+

You can see here there are three nodes in the cluster. That’s what we want. Galera cluster was successfully installed using AWS.

by Russell Dyer at August 08, 2019 02:56 PM

Aptira

Designing & Building a Network Functions Virtualisation Infrastructure (NFVi) Orchestration Layer using Cloudify

One of our customers was building a greenfield Network Functions Virtualisation Infrastructure (NFVi) and requires Orchestration capabilities, but lacked the skills to do this themselves. Designing an ideal deployment model of the Orchestration system using Cloudify is a major challenge, but this type of challenge is what the Aptira engineers relish.


The Challenge

This greenfield NFVi platform consists of a private Cloud with a high-fidelity full-stack configuration that includes a Cloud platform / Virtualised Infrastructure Manager (VIM), Orchestration, Software Defined Networking (SDN), and solution-wide alarming and monitoring spread across multiple data centres across multiple regions.

Their internal team did not have a deep skill base in the area of Network Function Virtualisation Orchestration (NVFO) and so turned to Aptira to augment their core team with these skills.

In this engagement Aptira was responsible for designing and building Orchestration layer using Cloudify in the NFVI platform. The requirements for the platform were world-class enterprise Telco standards, and presented multiple design challenges:

  • National service level scalability
  • High availability across geo-distributed NFV systems
  • Lack of concrete use case (since it was still early NFV days), and
  • The myriad technical and operational requirements associated with such a large-scale platform

The Aptira Solution

There were many stated requirements for the NFVi platform, but two requirements would determine the success or failure of the design: Scalability and Performance. These two key design requirements for building a large scale and geo-distributed NFV systems would be critical to the design.

Aptira’s analysis of the customer requirements zeroed in on one key factor: the number and distribution of VNF’s deployed and managed in the platform, combined with the frequency of configuration change. New VNF’s or changed orchestration models further increase the demand on the orchestration function.

Orchestration has been implemented in the customer’s NFVi platform using Cloudify to manage the VNF lifecycles. This increase in the number of VNF deployments may impact scalability and performance requirements in a non-linear manner. As such, these factors have impacted the design of the deployment architecture of the orchestration layer.

The key design considerations for the deployment architecture include (but are not limited to) the following:

  • Number of VNFs to be managed
  • Operational design of VNFs
  • Number of NFVi PoPs across which VNFs are to be orchestrated
  • Latency/delay between Cloudify and the VNFs
  • Number of technology domains across which Cloudify has to orchestrate
  • Envisioned roadmap of the expansion of NFVi deployments

Factoring all these design elements into our analysis, Aptira designed two deployment options for consideration by our customer:

  • Flat model: in which only one instance of Cloudify will be deployed. This Cloudify instance manages VNF instances and orchestration across different NFVi Domains/PoPs as shown in figure 1.
  • Hierarchical model: in which Cloudify is deployed in each of the NFVi domains managing VNFs and orchestrating resources across domain specific NFVi-PoPs. And then a Global orchestrator to handle the orchestration across multiple NFVi/technology domains as shown in figure 2.

Each model has its pros and cons:

Whilst the Flat model is simple to deploy and is able to handle most of the orchestration related transactions, it suffers when transactions are to be handled across multiple data centres thereby bringing in dependency on WAN latency.

The Hierarchical model requires careful consideration of resource allocation and deploying them in failure domains but has significant advantages while handling operational aspects of VNF’s such as Close Loop Automation Policy (CLAMP). Localizing such actions increases the uptime of VNFs.

Aptira presented two options mainly due to the absence of defined tenant workloads and use cases. Our intent was to demonstrate to the customer the full range of possibilities and to work with the customer on how to choose the appropriate deployment model depending on the emerging tenant requirements.


The Result

Aptira were able to validate both deployment models to the customer using a real telco use case, and also prepared a design paper for use by the solution architects working on the entire NFVi solution. This allowed customer to plan their deployments and talk to their tenants about the use cases that can be realized with such a model.


Keep your data in safe hands.
See what we can do to protect and scale your data.

Secure Your Data

The post Designing & Building a Network Functions Virtualisation Infrastructure (NFVi) Orchestration Layer using Cloudify appeared first on Aptira.

by Aptira at August 08, 2019 01:30 PM

August 07, 2019

Aptira

Swinburne Nextcloud Storage

Aprira Swinburne Nextcloud Case Study

Aptira previously built a large Ceph storage cluster for Swinburne. While the Ceph storage has been reliable, Swinburne wanted to offer a Dropbox-like user experience for staff and students on top of this on-premises storage.


The Challenge

Swinburne wanted to improve access to the Ceph storage in a number of ways:

  • Improve ease-of-use and features for users: the standard storage protocols offered by Ceph are not readily accessible by less technical users. Swinburne wanted to make storage available to a broader cohort of users by adding a user-friendly interface with sharing and collaboration features.
  • Reduce the maintenance required of their IT services department to manage it: At the time Swinburne were using a variety of methods to provision access to storage – all of them requiring manual steps before the storage could be delivered to a user. Keeping track of the current storage allocations had also become a burden for staff.
  • Integrate authentication with their existing Azure AD system: Allow users to login via SSO.
  • Integrate storage account requests into their existing ITSM system to enable self-service provisioning for users.

Swinburne had identified a few candidate products that might fulfill their requirements, but had not looked at each in any great depth due to internal resourcing constraints.


The Aptira Solution

Aptira first undertook an evaluation of four candidate storage applications. We rapidly deployed each application in an environment within Swinburne so the features and functionality of each could be compared. We produced a detailed evaluation report that allowed Swinburne to make an informed decision about which application to move forward with. Two leading candidates were put forward by Aptira and those deployments were converted into a larger-scale proof-of-concept that included integration with the actual Ceph storage so Swinburne staff and IT services team could get a feel for using each application.

The Nextcloud application was eventually chosen as it met the majority of their user and business requirements. From here Aptira developed a comprehensive solution architecture, paying particular concern to high availability and the ability to scale as the user base increased.

According to the solution architecture, Aptira deployed:

  • A MariaDB Galera cluster
  • A Kubernetes cluster to host the Nextcloud platform
  • Nextcloud, Redis and a MariaDB proxy as containers

Kubernetes was selected as the container orchestration platform due to its self-healing and scaling capabilities, and its ability to simplify application deployment and configuration. While the Nextcloud community provides a pre-built container image, it was not suitable for a multi-node production deployment, so we developed a custom image using the existing image as a base.

Maintainability was a significant concern for Swinburne so we ensured that all components of the architecture were deployed using Ansible to eliminate any manual steps in the deployment. We integrated our Ansible work into Swinburne’s existing Ansible Tower deployment, creating job templates so that deployments could be triggered from the Tower server. Since all of our work was being stored in Git on Swinburne’s GitLab server, we also created CICD pipelines to both build the Nextcloud container image and to trigger deployment to their test and production environments via Ansible Tower. During handover, Swinburne IT staff were able to deploy changes to the test environment by simply committing code to the repository.

Finally, we worked with ITSM staff to integrate the new service into Swinburne’s self-service portal, so users can request access to storage and make changes to their allocated quota.


The Result

Swinburne staff now have a stable and performant web-based storage service where they can upload, manage and share on-premises data.

As the uptake of the service increases, IT staff also have the confidence that the service can be scaled out to handle the increasing interest from users.

By recommending applications with an external API, Aptira made sure that Swinburne’s ITSM system would easily integrate with Nextcloud and satisfy Swinburne’s requirement to have a single pane of glass for all user service requests. With ITSM integration, Swinburne IT have also gained a charge-back capability to recover costs from other departments.

The solution was built with 100% open source components, reducing vendor lock-in.

While Aptira is happy to recommend and deploy greenfield DevOps infrastructure to support a company’s CICD needs, this project showed that we can also customise our solutions to fit in with our customers’ existing DevOps infrastructure, configuring a complete deployment pipeline for provisioning the entire solution.


OTHER SWINBURNE CASE STUDIES

  • Swinburne Case Study 1: We teamed up with SUSE up to build a very high-performing and scalable storage landscape at a fraction of the cost than we would have been able to with traditional storage systems.
  • Swinburne Case Study 2: Swinburne need to set up a massive (think petabyte) storage system for their researchers to store valuable research data.
  • Swinburne Case Study 3: As SUSE Storage 5 was released, Swinburne wanted to take advantage of its new features like, so we planned an upgrade to this latest version.
  • Swinburne Case Study 4: Swinburne wanted to offer a Dropbox-like user experience for staff and students on top of this on-premises storage.

Keep your data in safe hands.
See what we can do to protect and scale your data.

Secure Your Data

The post Swinburne Nextcloud Storage appeared first on Aptira.

by Aptira at August 07, 2019 01:41 PM

August 02, 2019

Chris Dent

Placement Update 19-30

Pupdate 19-30 is brought to you by the letter P for Performance.

Most Important

The main things on the Placement radar are implementing Consumer Types and cleanups, performance analysis, and documentation related to nested resource providers.

What's Changed

Stories/Bugs

(Numbers in () are the change since the last pupdate.)

There are 23 (2) stories in the placement group. 0 (0) are untagged. 3 (1) are bugs. 5 (0) are cleanups. 11 (1) are rfes. 4 (0) are docs.

If you're interested in helping out with placement, those stories are good places to look.

osc-placement

osc-placement is currently behind by 12 microversions.

  • https://review.opendev.org/666542 Add support for multiple member_of. There's been some useful discussion about how to achieve this, and a consensus has emerged on how to get the best results.

  • https://review.opendev.org/640898 Adds a new '--amend' option which can update resource provider inventory without requiring the user to pass a full replacement for inventory

Main Themes

Consumer Types

Adding a type to consumers will allow them to be grouped for various purposes, including quota accounting.

Cleanup

Cleanup is an overarching theme related to improving documentation, performance and the maintainability of the code. The changes we are making this cycle are fairly complex to use and are fairly complex to write, so it is good that we're going to have plenty of time to clean and clarify all these things.

I started some performance analysis this week. Initially I started working with placement master in a container but as I started making changes I moved back to container-less. What I discovered was that there is quite a bit of redundancy in the code in the objects package that I was able to remove. For example we were creating at least twice as many ProviderSummary objects than required in a situation with multiple request groups. It's likely there would have been more duplicates with more request groups. That's improved in this change, which is at the end of a stack of several other like-minded improvements.

The improvements in that stack will not be obvious until the more complex nested topology is generally available. My analysis was based on that topology.

Not to put too fine a point on it, but this kind of incremental analysis and improvement is something I think we (the we that is the community of OpenStack) should be doing far more often. It is incredibly revealing about how the system works and opportunities for making the code both work better and be easier to maintain.

One outcome of this work will be something like a Deployment Considerations document to help people choose how to tweak their placement deployment to match their needs. The simple answer is use more web servers and more database servers, but that's often very wasteful.

Other Placement

Miscellaneous changes can be found in the usual place.

There is one os-traits changes being discussed. And two os-resource-classes changes.

Other Service Users

New discoveries are added to the end. Merged stuff is removed. Anything that has had no activity in 4 weeks has been removed.

End

I started working with around approximately 20,000 providers this week. Only 980,000 to go.

by Chris Dent at August 02, 2019 12:35 PM

August 01, 2019

Thomas Goirand

My work during DebCamp / DebConf

Lots of uploads

Grepping my IRC log for the BTS bot output shows that I uploaded roughly 244 times in Curitiba.

Removing Python 2 from OpenStack by uploading OpenStack Stein in Sid

Most of these uploads were uploading OpenStack Stein from Experimental to Sid, with a breaking record of 96 uploads in a single day. As the work for Python 2 removal was done before the Buster release (uploads in Experimental), this effectively removed a lot of Python 2 support.

Removing Python 2 from Django packages

But once that was done, I started uploading some Django packages. Indeed, since Django 2.2 was uploaded to Sid with the removal of Python 2 support, a lot of dangling python-django-* needed to be fixed. Not only Python 2 support needed to be removed from them, but often, patches were needed in order to fix at least unit tests since Django 2.2 removed a lot of things that were deprecated since a few earlier versions. I went through all of the django packages we have in Debian, and I believe I fixed most of them. I uploaded 43 times some Django packages, fixing 39 packages.

Removing Python 2 support from non-django or OpenStack packages

During the Python BoF at Curitiba, we collectively decided it was time to remove Python 2, and that we’ll try to do as much of that work as possible before Bullseye. Details of this will come from our dear leader p1otr, so I’ll let him write the document and wont comment (yet) on how we’re going to proceed. Anyway, we already have a “python2-rm” release tracker. After the Python BOF, I then also started removing Python 2 support on a few package with more generic usage. Hopefully, touching only leaf packages, without breaking things. I’m not sure of the total count of packages that I touched, probably a bit less than a dozen.

Horizon broken in Sid since the beginning of July

Unfortunately, Horizon, the OpenStack dashboard, is currently still broken in Debian Sid. Indeed, since Django 1.11, the login() function in views.py has been deprecated in the favor of a LoginView class. And in Django 2.2, the support for the function has been removed. As a consequence, since the 9th of July, when Django 2.2 was uploaded, Horizon’s openstack_auth/views.py is boken. Upstream says they are targeting Django 2.2 for next February. That’s a way too late. Hopefully, someone will be able to fix this situation with me (it’s probably a bit too much for Django my skills). Once this is fixed, I’ll be able to work on all the Horizon plugins which are still in Experimental. Note that I already fixed all of Horizon’s reverse dependencies in Sid, but some of the patches need to be upstreamed.

Next work (from home): fixing piuparts

I’ve already written a first attempt at a patch for piuparts, so that it uses Python 3 and not Python 2 anymore. That patch is already as a merge request in Salsa, though I haven’t had the time to test it yet. What’s remaining to do is: actually test using Puiparts with this patch, and fix debian/control so that it switches to Python 2.

by Goirand Thomas at August 01, 2019 11:34 AM

July 31, 2019

Mirantis

Can we stop pretending everything is going to run in containers?

Containers are not the only technology out there -- and they never will be.

by Nick Chase at July 31, 2019 04:05 PM

Osones

Multi-AZ, remote backend, cinder-volume with OpenStack-Ansible

This article describes a common pattern we've been using at Osones and alter way for our customers deploying OpenStack with OpenStack-Ansible.

This pattern applies to the following context:

  • Multi-site (let's consider two) deployment, each site having its own (remote) block-storage storage solution (could be NetApp or similar, could be Ceph)
  • Each site will be an availability zone (AZ) in OpenStack, and in Cinder specifically
  • The control plane is spread across to the two sites (typically: two controllers on one site, one controller on another site)

Cinder is the OpenStack Block Storage component. The cinder-volume process is the one interacting with the storage backend. With some drivers, such as LVM, the storage backend is local to the node where cinder-volume is running, but in the case of drivers such as NetApp or Ceph, cinder-volume will be talking to a remote storage system. These two different situations imply a different architecture: in the first case cinder-volume will be running on dedicated storage nodes, in the second case cinder-volume can perfectly run along other control-plane services (API services, etc.), typically on controller nodes.

An important feature of Cinder is the fact that it can expose multiple volume types to the user. A volume type translates the idea of different technologies, or at least different settings, different expectations (imagine: more or less performances, more or less replicas, etc.). A Cinder volume type matches a Cinder backend as defined in a cinder-volume configuration. A single cinder-volume can definitely manage multiple backends, and that especially makes sense for remote backends (as defined previously).

Now when one wants to make use of the Cinder availability zones feature, it's important to note that a single cinder-volume instance can only be dedicated to a single availability zone. In other words, you cannot have a single cinder-volume part of multiple availability zones.

So in our multi-site context, each site having its own storage solution - considered remote to Cinder, and with cinder-volume running on the control plane, we'd be tempted to configure one cinder-volume with two backends. Unfortunately due to the limitation mentioned earlier, this is not possible if we want to expose multiple availabilty zones. It is therefore required to have one cinder-volume per availability zone. This is in addition to having cinder-volume running on all the controller nodes (typically: three) for obvious HA reasons. So we would end up with two cinder-volume (one per AZ) on each controller node; that would be six in total.

This is when OpenStack-Ansible and its default architecture comes in handy. OpenStack-Ansible runs most of the OpenStack (and some non-OpenStack as well) services inside LXC containers. When using remote backends, it makes sense to run cinder-volume in LXC containers, on control plane nodes. Luckily, it's as easy with OpenStack-Ansible to run one or many cinder-volume (or anything else, really) LXC containers per host (controller node), using the affinity option.

/etc/openstack_deploy/openstack_user_config.yml example to deploy two (LXC containers) cinder-volume per controller:

storage_hosts:
  controller-01:
    ip: 192.168.10.100
    affinity:
      cinder_volumes_container: 2
  controller-02:
    ip: 192.168.10.101
    affinity:
      cinder_volumes_container: 2
  controller-03:
    ip: 192.168.10.102
    affinity:
      cinder_volumes_container: 2

Then, thanks to the host_vars mechanism, it's also easy to push the specific availability zone configuration as well as the backend configuration to each cinder-volume. For example in the file /etc/openstack_deploy/host_vars/controller-01_cinder_volumes_container-fd0e1ad3.yml (name of the LXC container):

cinder_storage_availability_zone: AZ1
cinder_backends:
  # backend configuration

You end up with each controller being able to manage both storage backends in the two sites, which is quite good from a cloud infrastructure HA perspective, while correctly exposing the availability zone information to the user.

by Adrien Cunin at July 31, 2019 03:00 PM

OpenStack Superuser

OpenStack Homebrew Club: Meet the sausage cloud

Like a lot of engineers, Nick Jones never met a piece of hardware that didn’t spark joy. Or at least enough interest to keep around collecting dust until the next idea sparked.

One night at the pub, the community engineering lead at Mesosphere and former colleague Matt Illingworth realized that if they combined parts, they could build a “very small but serviceable” cloud platform: “A little too much for ‘homelab’ meddling, but definitely enough to do something interesting,” Jones says.

Our Homebrew series highlights how OpenStack powers more than giant global data centers, showing how Stackers are using it at home. Here we’re stretching the definition a little with this deployment is tucked away in a former bunker. Not exactly a  cluster in the closet, but decidedly in line with the hacker-hobbyist spirit.

Matt Illingworth checks the innards.

Now that the flame was lit, the pair ticked over options for where to put it. As luck would have it, a friend had plunked down for a decommissioned nuclear bunker tucked into the southern highlands of Scotland near Comrie. In one epic weekend, these brave hearts drove from Manchester to build a rack, install the hardware, deploy bootstrap and out-of-band infrastructure, configure basic networking and test it enough to manage with some confidence from remote. All in time to drive back 252 miles for their day jobs.

Figuring out what to call their creation kept them occupied from Glasgow to Lancaster.  Illington wanted something that was universally liked (skirting the horror of both vegans and many tech-conference attendees) and settled on sausage. That decision, in turn, flavored the names of the virtual machines: chipolata, hotdog, saveloy, cumberland and bratwurst.

“Anyway, it seemed like a good idea after having been awake for about 18 hours,” Jones says.

Once back home, with just a few more days of work using Canonical’s MAAS and the Kolla project, they had a functioning OpenStack platform up and running — as a public cloud.

“Anyone who still thinks OpenStack is hard to deploy and manage is dead wrong,” Jones says. Some 14 months and two upgrades later (made painless with Kolla, Jones adds) it’s still afloat. But perhaps not for long: they’re looking for folks to chip in to pay for costs and who might be interested in using it, too. (You can get in touch with Jones through his website.)

Superuser asked Jones a few questions about the particulars.

Tell us more about the hardware.

It’s a hobby project so hopefully we won’t be shamed for the pitiful state of the hardware but it’s good enough to be of use!
Right now it’s running on a selection of vintage HP BL460c G6 blades – 10 of them in total at the minute, each with 192GB RAM and a pair of mirrored SSDs. This gives us a reasonable amount of density and serviceable I/O, although they’re power hungry since they’re a very old generation of Xeon. Currently on 1GbE networking but we’re hoping to switch that out to 10GbE soon.

What are you running on it?

In terms of services, aside from the ’standard’ OpenStack services, it also runs Magnum for Kubernetes clusters on demand and Designate for DNS-as-a-Service. The one service we don’t yet run is Cinder so there’s no persistent storage available, but as with the networking upgrade we’re hoping to add a small amount of that in the not-too-distant future, again probably on donated hardware. No object storage either. Given the hardware we’d probably deploy Ceph to take care of both those.

Who’s using it and what are they doing with it?

Most of the users of the platform have found it useful to be able to spin up a handful of pretty big (over 16GB ram) VMs in order to be able to do remote development work. It’s really handy for people who don’t want to run big Devstack or Minikube (for example) clusters on their laptops locally and who’d rather just SSH into somewhere else to do that sort of thing, but not worry about a really expensive bill which would be the case pretty much everywhere else. With enough of us who find such a service useful all clubbing together, it just about covers the costs of running it.

What’s next

Longer-term plans are to put the configuration for the whole platform online and welcome pull requests to add or change configuration for various services – along with comprehensive testing, of course. This would probably appeal to a subset of OpenStack developers who’d like to test how their service runs on a public cloud.

More on the specifics of the deployment on his blog.

Got an OpenStack homebrew story? Get in touch: editorATopenstack.org

All photos courtesy Nick Jones.

The post OpenStack Homebrew Club: Meet the sausage cloud appeared first on Superuser.

by Nicole Martinelli at July 31, 2019 02:01 PM

Aptira

Comparison of Software Defined Networking (SDN) Controllers. Part 7: Comparison and Product Rating

This final part of our Software Defined Networking (SDN) Controller comparison series includes an in-depth evaluation and product rating for each of the most popular Open Source SDN controllers in industry and academia including: the Open Network Operation System (ONOS), OpenDayLight (ODL), OpenKilda, Ryu and Faucet.
It is important to understand the motivations behind the available platforms. Each design has different use cases as usage depends not only on the capability matrix, but also on the cultural fit of the organisation and the project.

Architecture

As with most platforms, there are trade-offs to be considered when comparing a centralised, tightly coupled control plane to a decentralised, scalable and loosely coupled alternative SDN controller.

Centralised architectures such as ONOS and ODL tend to be easier to maintain and confer lower latency between the tightly coupled southbound API, PCE and Northbound APIs. However, as the scale increases, centralised controllers can become a bottleneck. In an SD-WAN context this can increase control plane latency but can be mitigated in a distributed architecture.

Distributed architectures such as OpenKilda and Faucet are generally more complex to maintain and deploy but can allow the platform to scale more effectively. By decoupling the processing of PCE, Telemetry and Southbound interface traffic, each function can be scaled independently to avoid performance bottlenecks. Additionally, specialised tools to handle big datasets, time series databases or path computation at scale become viable without adversely impacting southbound protocol performance.

Ryu is different to the other options, although having a core set of programs that are run as a ‘platform’, it is better thought of as a toolbox, with which SDN controller functionality can be built.

Modularity and Extensibility

The modularity of each controller is governed by the design focus and programming languages. Platforms such as ONOS and ODL have built-in mechanisms for connecting code modules, at the expense of centralising processing to each controller. These two Java-based controllers take advantage of OSGi containers for loading bundles at runtime, allowing a very flexible approach to adding functionality.

Python based controllers such as Ryu provide a well-defined API for developers to change the way components are managed and configured.

Adding functionality to Faucet and OpenKilda is achieved through modifying the systems that make use of their northbound interfaces, such as the Apache Storm cluster or equivalent. This provides the added flexibility of using different tools and languages depending on the problem being solved. Additionally, increasing the complexity of northbound interactions does not negatively impact on the SDN directly.

Scalability

Of the options being considered, only ONOS and ODL contain internal functionality for maintaining a cluster. Each of these platforms is backed by a distributed datastore that shares the current SDN state and allows for controllers to failover in the event of a cluster partition. As new releases of each of the controllers emerge, this functionality looks to be evolving.

OpenKilda approaches cluster scalability in a modular way. While Floodlight is used as a southbound interface to the switch infrastructure, responsibility for PCE and telemetry processing is pushed northward into a completely separate Apache Storm based cluster. Each Floodlight instance is idempotent, with no requirement to share state. The Apache Storm cluster is by design horizontally scalable and allows throughput to be increased by adding nodes.

Both Ryu and Faucet contain no intrinsic clustering capability and require external tools such as Zookeeper to distribute a desired state. With both of these platforms, extra instances of the controller can be started independently as long as the backing configuration remains identical. PCE functionality for these controllers could be pushed down to the instance in the form of modules, or implemented in a similar manner to OpenKilda, backed by a processing cluster of choice.

As the scale of the SDN grows, it becomes untenable for a single localised cluster to handle the load from every switch on the network. Leaving aside geographic distribution of the controllers, breaking the network into smaller logical islands decreases the need for a single southward looking cluster to be massively scalable. With this design, coordination between the islands becomes critical and while a centralised view of the network is still required, the absence of PCE and telemetry processing should not affect data plane stability once flows are configured.

Ryu, Faucet, ODL and ONOS all look to scale in this way by including native BGP routing capabilities to coordinate traffic flows between the SDN islands. Universal PCE and telemetry processing will need to be developed for each of these cases with OpenKilda providing a working reference architecture for achieving this. Due to the state of the documentation for OpenKilda, the BGP will need to be developed.

Interfaces

Considering future compatibility requirements for southbound control, ONOS, ODL and Ryu include protocols beyond just OpenFlow. P4, Netconf and OF-Config could enable additional switch hardware options moving forward should it be required.

The northbound API turns out to be one of the key differentiators between the platforms on offer. ONOS and ODL offer the largest set of northbound interfaces with gRPC and RESTful APIs (among others) available, making them the easiest to integrate. Ryu and OpenKilda offer limited RESTful compared to ONOS and ODL. Faucet takes a completely different approach to applying changes, relying on configuration files to track intended system state instead of instantaneous API calls. This approach will require external tools for dynamically applying configuration but does open the SDN to administration by well-understood CI/CD pipelines and testing apparatus.

Telemetry

One of the primary problems with maintaining an SDN is extracting and using any available telemetry to infer system state and help remediate issues. On this front, ODL lacks functionality, with telemetry still being an experimental module in the latest upstream version. ONOS has modules available to allow telemetry to be used through Grafana or InfluxDB.

Faucet can export telemetry into Influxdb, Prometheus or flat text log files. While Prometheus saves data locally, it can also be federated, allowing centralised event aggregation and processing, while maintaining a local cache to handle upstream processing outages and maintenance.

OpenKilda uses Storm which provides a computation system that can be used for real-time analytics. Storm passes the time-series data to OpenTSDB for storing and analysing. Neo4j, a graph analysis and visualisation platform and provided the PCE functionality initially.

Ryu doesn’t provide any telemetry functionality. This needs to be provide via external tools.

Resilience and Fault Tolerance

The ONOS and ODL platforms implement native clustering as part of their respective offerings. ONOS and ODL provide fault tolerance in the system with an odd number of SDN controllers. In the event of master node failure, a new leader would be selected to take the control of the network. The mechanism of choosing a leader is slightly different in these two controllers, while ONOS focuses on eventually consistent ODL focuses on high availability.

The remaining controllers (OpenKilda, Ryu and Faucet) have no inbuilt clustering mechanism, instead relying on external tools to maintain availability. This simplifies the architecture of the controllers and releases them from the overhead of maintaining distributed databases for state information. High availability is achieved by running multiple, identically configured instances, or a single instance controlled by an external framework that detects and restarts failed nodes.

For Ryu, fault tolerance can be provided by Zookeeper for monitoring the controllers in order to detect controller’s failure and sharding state between cluster members. For Faucet in particular, which is designed to sit in a distributed, shared SDN and be controlled by static configuration files, restarting a controller is a quick, stable exercise that has no reliance on upstream infrastructure once the configuration is written.

Programming Language

ONOS, ODL and OpenKilda are written in Java, for which development resources are abundant in the market, with good supporting documentation and libraries available. While using Java should not be seen as a negative, Java processes can tend to be heavyweight and require resource and configuration management to keep them lean and responsive.

Ryu and Faucet are written in Python, a well-supported language and has an active community developing the framework. The documentation is concise and technical, aimed at developers to maximise the utility of the system. Python is not a fast language and has inherent limitations due to both the dynamic type representations being used and limited multi-threaded capabilities (when compared with Java, Golang or C++).

Community

Both ODL and ONOS benefit from large developer and user communities under the Linux Foundation Networking banner. Many large international players are involved in the development and governance of these projects, which could add to the longevity and security over time. A possible downside is, as with any large project, there are many voices trying to be heard and stability can be impacted by feature velocity. This has occurred with similar projects such as OpenStack in the immediate past.

OpenKilda is a small but active community which can limit the supportability, velocity and features of the platform. OpenKilda needs your support – chat with us to get involved.

Between these two extremes are RYU and Faucet. Both are well supported, targeted controllers. Due to the emerging nature of the field, both options look to have a bright future, with a simpler, streamlined approach to change submission and testing.

Evaluation Scoring Table

Based on the above criteria, we’ve scored each product against each weighted criterion. The results are below:

Criterion Weight ONOS  ODL OK Ryu Faucet
OpenFlow Support        20.0         20.0         19.0         12.0         20.0         20.0
Northbound API support        20.0         20.0         20.0         12.0         16.0           8.0
Southbound API support        10.0         10.0         10.0           6.0           8.0           8.0
Programming Language          5.0           4.0           4.0           4.5           4.5           4.5
Core Components features / services          5.0           4.5           4.5           3.5           2.0           3.5
Native Clustering Capabilities        10.0           9.0           7.0         10.0           2.0           5.0
Typical Architecture          3.0           2.7           2.4           2.7           2.4           2.7
Horizontal Scalability          5.0           3.5           3.0           4.5           1.0           4.0
Vertical Scalability          5.0           3.5           3.0           5.0           4.5           0.5
Extensibility          2.0           1.8           1.6           1.8           1.8           1.6
Community Size & Partnerships          5.0           4.5           4.5           1.0           4.5           3.5
Resilience and Fault Tolerance          5.0           4.0           3.0           4.5           4.0           4.5
Operations Support          5.0           4.5           2.5           4.0           2.5           3.5
Weighted Score 100 92 84.5 71.5 73.2 69.3

Product Rating

Based on our weighted criteria-based scoring, the evaluation ranks the products as per the below table:

Rank Product Score
1 ONOS  92.0%
2 ODL 84.5%
3 Ryu 73.2%
4 OK 71.5%
5 Faucet 69.3%

Conclusion

This effort spent investigating the current Software Defined Networking (SDN) Controller platforms can be used to provide insight for users into available Open Source SDN controllers. This might help them to choose the right SDN controller for their platform which match their network design and requirements.

SDN Controller Comparisons:

Remove the complexity of networking at scale.
Learn more about our SDN & NFV solutions.

Learn More

The post Comparison of Software Defined Networking (SDN) Controllers. Part 7: Comparison and Product Rating appeared first on Aptira.

by Farzaneh Pakzad at July 31, 2019 01:02 PM

Chris Dent

Profiling Placement in Docker

Back in March, I wrote Profiling WSGI Apps, describing one way to profile the placement service. It was useful enough that a version of it was added to the docs.

Since then I've wanted something a bit more flexible. I maintain a container for placement on Docker hub. When I want to profile recent master instead of code in a proposed branch, using a container can be tidier. Since this might be useful to others I thought I better write it down.

Get and Confirm the Image

Make sure you are on a host with docker and then get the latest version of placedock:

docker pull cdent/placedock

We can confirm the container is going to work with a quick run:

docker run -it --env OS_PLACEMENT_DATABASE__SYNC_ON_STARTUP=True \
               --env OS_API__AUTH_STRATEGY=noauth2 \
               -p 127.0.0.1:8080:80 \
               --rm --name placement cdent/placedock

In another terminal check it is working:

curl -s http://127.0.0.1:8080/ |json_pp

should result in something similar to:

{
   "versions" : [
      {
         "min_version" : "1.0",
         "links" : [
            {
               "href" : "",
               "rel" : "self"
            }
         ],
         "status" : "CURRENT",
         "max_version" : "1.36",
         "id" : "v1.0"
      }
   ]
}

ctrl-c in the terminal that is running the container. We don't want to use that one because we need one that will persist data properly.

Dockerenv for Convenience

To enable persistence, a dockerenv file will be used to establish configuration settings. The one I use, with comments:

# Turn on debug logging
OS_DEFAULT__DEBUG=True
# Don't use keystone for authentiation, instead pass
# 'x-auth-token: admin' headers
OS_API__AUTH_STRATEGY=noauth2
# Make sure the database has the right tables on startup
OS_PLACEMENT_DATABASE__SYNC_ON_STARTUP=True
# Connection to a remote database. The correct database URI depends on
# your environment.
OS_PLACEMENT_DATABASE__CONNECTION=postgresql+psycopg2://cdent@192.168.1.76/placement?client_encoding=utf8
# The directory where profile output should go. Leave this commented until
# sufficient data is present for a reasonable test.
# OS_WSGI_PROFILER=/profiler

Create your own dockerenv file, set the database URI accordingly (if you're not sure about what this might be, Quick Placement Development may be useful), and start the container back up. This time we will use the dockerenv file and put the container in the background:

docker run -idt -p 127.0.0.1:8080:80 \
       --env-file dockerenv \
       --rm --name placement \
       -v /tmp/profiler:/profiler \
       cdent/placedock

We've added a volume so that, eventually, profiler output can be saved to the disk of your container host, rather than the container itself. For the time being profiling is turned off because we don't want to slow down the system while adding data to the system.

If you want to, confirm things are working again with:

curl -s http://127.0.0.1:8080/ |json_pp

Load Some Data

In some cases you don't need pre-existing data when profiling. If that's the case, you can skip this step. What you need to use to set up data may be very different from what I'm doing here.

Loading up the service with a bunch of data can be accomplished in various ways. I use a combination of shell scripts and gabbi. The shell script is responsible for the dynamic data while gabbi is responsible the static structure. Some pending changes use the same system for doing some performance testing for placement. We will borrow that system. To speed things up a bit we'll use parallel.

seq 1 100 | \
   parallel "./gate/perfload-nested-loader.sh http://127.0.0.1:8080 gate/gabbits/nested-perfload.yaml"

Note: This will not work on a mac if you are using the built in uuidgen. You may need a pipe to tr [A-Z] [a-z].

You can see how many providers you've created with a request like:

curl -s -H 'x-auth-token: admin' \
     -H 'openstack-api-version: placement latest' \
     http://127.0.0.1:8080/resource_providers | \
     json_pp| grep -c '"name"'

Once you have a sufficient number of resource providers and anything else you might like (such as allocations) in the system, you can start profiling.

Profiling

When we were loading data we had profiling turned off. Now we'd like to turn it on. Edit the dockerenv file to uncomment the OS_WSGI_PROFILER line and then docker kill placement and run the container again (using the same args as above). A restart will not work because we've change the environment.

Make a query, such as:

curl http://127.0.0.1:8080/

and look in /tmp/profiler for the profile output, the filename should looking something like this:

GET.root.19ms.1564579909.prof

If you have snakeviz installed you can inspect the profiling info from a browser (as described in the previous blog post:

snakeviz GET.root.19ms.1564579909.prof

That's not a very interesting request. It doesn't exercise much of the code nor access the database. If the system has been loaded with data as above, the following will query it:

curl -H 'x-auth-token: admin' \
     -H 'openstack-api-version: placement latest' \
"http://127.0.0.1:8080/allocation_candidates?\
resources=DISK_GB:10&\
required=COMPUTE_VOLUME_MULTI_ATTACH&\
resources_COMPUTE=VCPU:1,MEMORY_MB:256&\
required_COMPUTE=CUSTOM_FOO&\
resources_FPGA=FPGA:1&\
group_policy=none&\
same_subtree=_COMPUTE,_FPGA"

and then:

snakeviz GET.allocation_candidates.792ms.1564581384.prof

snakeviz sunburst

by Chris Dent at July 31, 2019 01:00 PM

StackHPC Team Blog

CloudKitty and Monasca: OpenStack charging without Telemetry

CloudKitty and Monasca project mascots

Tracking resource usage, and charging for it, is a requirement for many cloud deployments. Public clouds obviously need to bill their customers, but private clouds can also use chargeback and showback policies to encourage more efficient use of resources. In the OpenStack world, CloudKitty is the standard rating solution. It works by applying rating rules, which turn metric measurements into rated usage information.

For several years, gathering metrics in OpenStack has been implemented by two separate project teams: Telemetry and, more recently, Monasca. The future of Telemetry, which produces the Ceilometer software, is uncertain: historical contributors have stopped working on the project and its de-facto back end for measurements, Gnocchi, is also seeing low activity. Although Telemetry users have volunteered to maintain the project, the Monasca project appears to be healthier and more active.

Since deploying Monasca is our preferred choice to monitor OpenStack, we asked ourselves: can we use CloudKitty to charge for usage without deploying a full Telemetry software stack?

Ceilometer + Monasca = Ceilosca

Ceilometer is well integrated in OpenStack and can collect usage data from various OpenStack services, either by polling or listening for notifications. Ceilometer is designed to publish this data to the Gnocchi time series database for storage and querying.

In Monasca, metrics collected by the Monasca Agent focus more on monitoring the health and performance of the infrastructure and its services, rather than resource usage from end users (although it can gather instance metrics via the Libvirt plugin). Monasca stores these metrics in a time series database, with support for InfluxDB and Cassandra.

Despite this, we are not required to deploy and maintain Gnocchi just to collect usage data via Ceilometer: monasca-ceilometer, also known as Ceilosca, enables Ceilometer to publish data to the Monasca API for storage in its metrics database. Although Ceilosca currently lives in its own repository and must be installed by adding it to the Ceilometer source tree, there is an ongoing effort to integrate it directly into Ceilometer.

By default, Ceilosca will push several metrics based on instance detailed information, such as disk.root.size, memory, and vcpus, to Monasca under the service tenant. Each metric will be associated with a specific instance ID via the resource_id dimension. Metric dimensions also include user and project IDs. For example, to retrieve metrics associated with the p3 project, we can use the Monasca Python client:

monasca metric-list \
--tenant-id $(openstack project show service -c id -f value) \
--dimensions project_id=$(openstack project show p3 -c id -f value)

Once stored in Monasca, these metrics can be used by CloudKitty, thanks to the inclusion of a Monasca collector since the Queens release.

Let's see how we can apply a charge to the vcpus metric. We need to configure CloudKitty with the metrics.yml file to know about our metric:

metrics:
  vcpus:
    unit: vcpus
    groupby:
      - resource_id
    extra_args:
      resource_key: resource_id

Then, we configure the hashmap rating rules to apply a rate to CPU usage. We create a vcpus service and then create a mapping with a cost of 0.5 per CPU hour:

$ cloudkitty hashmap service create vcpus
+-------+--------------------------------------+
| Name  | Service ID                           |
+-------+--------------------------------------+
| vcpus | cb72cd89-43ef-46b9-b047-58e0b5335992 |
+-------+--------------------------------------+
$ cloudkitty hashmap mapping create 0.5 -s cb72cd89-43ef-46b9-b047-58e0b5335992 -t flat
+--------------------------------------+-------+------------+------+----------+--------------------------------------+----------+------------+
| Mapping ID                           | Value | Cost       | Type | Field ID | Service ID                           | Group ID | Project ID |
+--------------------------------------+-------+------------+------+----------+--------------------------------------+----------+------------+
| 68465dad-7c68-4f8e-a256-6a62735c1e3b | None  | 0.50000000 | flat | None     | cb72cd89-43ef-46b9-b047-58e0b5335992 | None     | None       |
+--------------------------------------+-------+------------+------+----------+--------------------------------------+----------+------------+

We then launch an instance. Once the instance becomes active, a notification is processed by Ceilometer and published to Monasca, recording that instance b7d926a8-cd63-4205-8f90-e3c610aeaad5 has 64 vCPUs.

$ monasca metric-statistics --tenant-id $(openstack project show service -c id -f value) vcpus avg "2019-07-30T14:00:00" --merge_metrics --group_by resource_id --period 1
+-------+---------------------------------------------------+----------------------+--------------+
| name  | dimensions                                        | timestamp            | avg          |
+-------+---------------------------------------------------+----------------------+--------------+
| vcpus | resource_id: b7d926a8-cd63-4205-8f90-e3c610aeaad5 | 2019-07-30T14:43:01Z |       64.000 |
+-------+---------------------------------------------------+----------------------+--------------+

With the default Kolla configuration, Nova also sends a report notification every hour, which is also stored in Monasca. Similarly, when an instance is terminated, a notification is published and converted into a final measurement in Monasca. However, using the default CloudKitty configuration, every instance measurement is interpreted as if the associated instance ran for the whole hour. For example, an instance launched at 10:45 and terminated at 11:15 would result in two whole hours being charged, instead of just 30 minutes. This can be mitigated by reducing the [collect]/period setting in cloudkitty.conf, for example down to one minute, and adjusting the charge rate to match the new period. For this approach to work, we need to have at least one measurement stored for each period. This isn't possible with audit notifications sent by Nova because one hour is the lowest possible period. An alternative is to rely on continously updated metrics collected by Ceilometer, such as CPU utilisation. However, this kind of Ceilometer metrics is unavailable in our bare metal environment.

Once CloudKitty has analysed usage metrics, we can extract rated data to CSV format. As can be seen below, two whole hours have been charged for 0.5 each. In this case, the instance had been launched around 14:45 and terminated around 15:20. We have compared using pure Ceilometer and Gnocchi instead of Ceilosca and Monasca and noticed the exact same issue.

$ cloudkitty dataframes get -f df-to-csv --format-config-file cloudkitty-csv.yml
Begin,End,Metric Type,Qty,Cost,Project ID,Resource ID,User ID
2019-07-30T14:00:00,2019-07-30T15:00:00,vcpus,64.0,32.0,35be5437552f40cba2aa6e5cb47df613,b7d926a8-cd63-4205-8f90-e3c610aeaad5,53ed408e5a7a4e79baa76803e1df61d6
2019-07-30T15:00:00,2019-07-30T16:00:00,vcpus,64.0,32.0,35be5437552f40cba2aa6e5cb47df613,b7d926a8-cd63-4205-8f90-e3c610aeaad5,53ed408e5a7a4e79baa76803e1df61d6

A downside of using Ceilosca instead of Ceilometer with Gnocchi is that metadata such as instance flavour is not available for CloudKitty to use for rating by default, at least in the Rocky release that we used. We will update this post if we can develop a configuration for Ceilosca that supports this feature.

OpenStack usage metrics without Ceilometer

Monasca has plans to capture OpenStack notifications and store them with the Monasca Events API, although this is not yet implemented. CloudKitty would require changes to support charging based on these events, since it is currently designed around metrics. It is worth pointing out that an ElasticSearch storage driver has just been proposed in CloudKitty, so these two new designs may line up in the future.

In the meantime, an alternative is to bypass Ceilometer completely and rely on another mechanism to publish metrics to Monasca. As mentioned earlier in this article, Monasca can provide instance metrics via the Libvirt plugin. However, this won't cover other services for which we may want to charge, such as volume usage.

Since the Monasca Agent can scrape metrics from Prometheus exporters, we are exploring whether we can leverage openstack-exporter to provide metrics to be rated by CloudKitty. Stay tuned for the next blog post on this topic!

by Pierre Riteau at July 31, 2019 08:50 AM

Colleen Murphy

How to get work done in open source

After working in open source for a while, I've been on both sides of the code submission dance: proposing a change, and reviewing a change.

Proposing a change to a project can feel harrowing: you don't know how the maintainers are going to respond to it, you don't know whether …

by Colleen Murphy at July 31, 2019 03:00 AM

July 30, 2019

OpenStack Superuser

Building a virtuous circle with open infrastructure: Inclusive, global, adaptable

Technology is constantly evolving. From couch surfing to surfing the web, creative problem solving either rides the wave in real time or gets towed under.

This is where open infrastructure is barreling ahead.

Take, for example, traffic. Great for websites – not so much for commutes. Just as a city expands roads and adds stoplights to meet the needs of a growing population, technology infrastructure demands the same type of thinking. Adaptability is key to keep things moving

These two words — open infrastructure — carry a lot of meaning. Let’s break them down in context.

Open refers to open-source components, meaning that the source code is available to anyone to use, study and share with others. Open source is becoming increasingly important to organizations worldwide. Why? Because they’re not locked into closed proprietary boxes, people can build on them and tailor them as needed, with the freedom and flexibility to innovate more effectively. It also allows them to find and create custom solutions faster than the market can provide.

But open source is only half of this equation. As Mark Collier, OpenStack Foundation COO says, “Open source is required, but it’s not enough.” It rests on a base of infrastructure.

Infrastructure is the backbone that supports hardware, software, networks, data centers, facilities and related equipment used to develop, test, operate, monitor, manage or support information technology services.

In its simplest form, Open Infrastructure is IT infrastructure built from open-source technologies, available for all users to work with, to improve and contribute back.

To ensure all these users benefit from open-source software, are able to engage with the community and chart the future course for its development, the Four Opens were set out in early 2010 as guiding principles.

  • Open source – all software developed must be done under an open source license. The software must be usable and scalable for all users anywhere in the world and cannot be feature or performance limited.
  • Open community – the OpenStack community is a level playing field, where anyone can rise or get elected to leadership positions. There is no reserved seat: contribution is the only valid currency.
  • Open development – all development processes are transparent and inclusive. Everyone is welcome to participate and suggestions are considered regardless of prior levels of contribution.
  • Open design – design is not done behind closed doors. It is done in the open, and includes as many people as possible.

These principles have been fundamental to making the OpenStack community what it is today. Currently, the community is drafting blog posts around the Four Opens chronicling the learnings and successes from this approach to share with the broader Open Infrastructure ecosystem. If you want to get involved, you can do so here.

“From the beginning, OpenStack knew there would be a need to interact and integrate with external projects,” says Allison Randal, a board member of OpenStack Foundation since 2012, in an Open Infra Summit lightning talk. That said, as the number of adjacent use cases expanded — from edge, CI/CD, containers and more — the landscape also shifted. The shift was towards offering broader solutions to organizations and projects, supporting their overall IT infrastructure needs.

As the projects grew in number and in size, it became clearer that the ongoing success of OpenStack and these projects were interdependent. This made the shift in focus even more natural and further aligned OpenStack with the goal of supporting and strengthening community.

To further encourage collaboration and boost inclusivity, the OpenStack Summit changed its name to the  Open Infrastructure Summit. The Summit provides common ground from all corners of the community — from 5G to hybrid cloud and dozens of adjacent open-source projects — and the name change reflects this. The goal is to make the summit more open and welcoming to all projects and invite everyone to learn alongside the people building and operating open infra.

Recognizing the shift in open infrastructure, the OSF is embracing it by supporting the communities helping to shape this movement. The OSF also recognizes that no single technology solution is going to support this transition and the integration and knowledge sharing around these open technologies is key to successful implementation.

As our world continues to evolve, open infrastructure will adapt and evolve with it.

Superuser is always interested in community content. Got something to say? Get in touch: editorATopenstack.org

The post Building a virtuous circle with open infrastructure: Inclusive, global, adaptable appeared first on Superuser.

by Ashleigh Gregory and Nicole Martinelli at July 30, 2019 02:04 PM

Aptira

Comparison of Software Defined Networking (SDN) Controllers. Part 6: Faucet

Comparison of Software Defined Networking (SDN) Controllers. Faucet

The final Open Source Software Defined Networking (SDN) Controller to be compared in this series is Faucet. Built on top of Ryu, Faucet is a lightweight SDN Controller adding a critical northbound function for operations teams.

Faucet is a compact open source OpenFlow controller, which enables network operators to run their networks the same way they do server clusters. Faucet moves network control functions (like routing protocols, neighbor discovery, and switching algorithms) to vendor independent server-based software, versus traditional router or switch embedded firmware, where those functions are easy to manage, test, and extend with modern systems management best practices and tools.

Architecture

As shown in the figure below, architecturally, each Faucet instance has two connections to the underlying switches. One for control and configuration updates, the other (Gauge) is a read-only connection specifically for gathering, collating and transmitting state information for processing elsewhere using Influxdb or Prometheus.

Comparison of Software Defined Networking (SDN) Controllers. Faucet Diagram

Modularity and Extensibility

Python based controllers provide a well-defined API for developers to change the way components are managed and configured.

Adding functionality to Faucet is achieved through modifying the systems that make use of its Northbound interfaces. This provides the added flexibility of using different tools and languages depending on the problem being solved. Additionally, increasing the complexity of northbound interactions does not negatively impact the SDN directly.

Scalability

Faucet is designed to be deployed at scale such that each instance is close to the subset of switches under its control. Each instance of Faucet is self-contained and can be deployed directly to server hardware or through containers, moving the administration back into well understood areas of automation.

Due to the lightweight nature of the code and the smaller control space for each instance, no clustering is required – each instance is completely idempotent and concerns itself with only what it is configured to control.

Cluster Scalability

  • Faucet contains no intrinsic clustering capability and requires external tools such as Zookeeper to distribute state if this is desired. Extra instances of the controller can be started independently as long as the backing configuration remains identical.
  • PCE functionality for these controllers could be pushed down to the instance in the form of modules, or implemented in a similar manner to OpenKilda, backed by a processing cluster of choice.

Architectural Scalability

  • It does not yet support a cooperative cluster of controllers.

Interfaces

  • Southbound: It supports multiple southbound protocols for managing devices, such as OpenFlow, VLANs, IPv4, IPv6, static and BGP routing, port mirroring, policy-based forwarding and ACLs matching.
  • Northbound: YAML configuration files track the intended system state instead of instantaneous API calls, requiring external tools for dynamically applying configuration. However, it does open the SDN to administration by well-understood CI/CD pipelines and testing apparatus.

Telemetry

Faucet can export telemetry into Influxdb, Prometheus or flat text log files. While Prometheus saves data locally, it can also be federated, allowing centralised event aggregation and processing, while maintaining a local cache to handle upstream processing outages and maintenance.

Resilience and Fault Tolerance

Faucet has no inbuilt clustering mechanism, instead relying on external tools to maintain availability. High availability is achieved by running multiple, identically configured instances, or a single instance controlled by an external framework that detects and restarts failed nodes.

For Faucet in particular, which is designed to sit in a distributed, shared SDN and be controlled by static configuration files, restarting a controller is a quick, stable exercise that has no reliance on upstream infrastructure once the configuration is written.

Programming Language

Faucet is written in Python.

Community

Faucet has an active community developing the framework and it is well supported.

Conclusion

Faucet is configured via a YAML file, which makes it a suitable option for CI/CD and testing environments. Faucet uses Prometheus for telemetry processing while other components such as PCE needs to be developed.
This is the last controller we will evaluate as part of this series. The next post will include a scored rating and detailed evaluation for each SDN controller.

SDN Controller Comparisons:

Remove the complexity of networking at scale.
Learn more about our SDN & NFV solutions.

Learn More

The post Comparison of Software Defined Networking (SDN) Controllers. Part 6: Faucet appeared first on Aptira.

by Farzaneh Pakzad at July 30, 2019 01:32 PM

July 29, 2019

OpenStack Superuser

Beating the learning curve at OSCON

PORTLAND, Ore. — When I registered for this year’s Open Source Conference (OSCON), which was also my first, I selected four tutorials as a part of my ticket. Options ranged from making art with open-source libraries to database management. Many of the sessions looked like the winning card for buzzword bingo: Blockchain! machine learning! serverless! After some deliberation, I went with hands-on sessions about Rust, p5,js, building an AI assistant and constructing a programming language.

First day jitters

Leading up to the event, participants were asked to have all the prerequisites set up for the tutorials. We got daily reminders. Daily. For each tutorial. This was annoying to say the least, especially since many of the tutorials told you to clone some git repo and then they didn’t include a link to it in the daily email. There was no way to opt out when you had completed the requirements either.

Monday morning,  I arrived bright and early at the venue courtesy MAX. I’d expected the light rail system jammed with conference goers (think: the rib-crushing crowds on trains to FOSDEM) and was pleasantly surprised when it wasn’t. I walked right up to registration, typed in my email address, looked up my ticket and got my badge printed in a flash. (For people still intent on making QR codes good for something, you could log in that way,too.) You were offered a weekly MAX ticket along with your voucher for the conference t-shirt and conference book (it is O’Reilly, after all).

The first tutorial was great! I learned the basics of Rust via a fun lab that involved sword fighting. The instructor was deft at breaking up the material into smaller topics complete with examples before jumping into exercises in the lab covering the new material.

If only the second one had been a bit less dry. It sounded promising: build an AI assistant that you could interact with using the open-source project Rasa. The tutorial was essentially ‘teaching’ the AI assistant by listing thousands of example inputs into the config. The more examples you provide, the more accurate the response of the assistant. It was less engaging than the earlier tutorial, but the instructor was much newer to teaching than the previous one. With a few more repetitions, this could improve.

Tuesday, it was time for the next two hands-on sessions: Processing Foundation’s p5.js project and building a programming language. Despite all the preparatory nudges, it wasn’t clear to me that it was basically a refresher for a few of my college classes  (I had a visualization class that used processing and a C++ class where we built a natural language processor). That said, both instructors were very good. The first tutorial was similar to that of the previous day where some slides walked through particular aspects of the language and some examples before offering a wider view on a larger project to apply the knowledge. The afternoon was a little more continuous and sans slides.

General assembly

The next two days kicked off with keynotes before hitting a roster of presentations. As carefully staged as these performances are, you can’t control everything: Wednesday morning’s keynotes were interrupted by a fire alarm. OSCON organizers still managed to end on time with only a few small changes to the keynote schedules – most speakers had their time cut a few minutes across both days and one keynote got bumped to day two.

The content of the keynotes split among two main themes: the importance of community and how it adds value, stability and marketability to any open-source project and how open source is part of the future for most businesses (hopefully not the entire business plan, but definitely playing a role.) I found myself agreeing with many of the key messages and appreciating the general rallying cry to open source and not one specific project or foundation. A single project won’t solve all the industry’s problems just as a single foundation is not the best home for all projects.

I crammed my agenda with sessions on open-source community, governance models in open source and themes like how to be a good community member etc. Nothing was earth-shattering or exactly new, but it’s always a pleasure to see a lot of good speakers share their experiences and observations. I also really enjoyed that many of the things to aspire to/good community traits/best practices are already built into the OpenStack community. It made me appreciate the stability of our community and the efforts of those who’ve come before me.

Overall, I’d give the event a A- or B+. OSCON brought together a cool mix of open-source projects from many foundations on a relatively level playing field.

Next year, you’ll still find me haunting the halls — even if my talks don’t get picked again, ahem — and I hope next time OpenStack, Kata,  Zuul, Airship and StarlingX can have a larger presence there.

The post Beating the learning curve at OSCON appeared first on Superuser.

by Kendall Nelson at July 29, 2019 02:01 PM

Aptira

Comparison of Software Defined Networking (SDN) Controllers. Part 5: Ryu

Comparison of Software Defined Networking (SDN) Controllers. Ryu

Our Open Source Software Defined Networking (SDN) Controller comparison continues with Ryu. Ryu is a very different proposition to the other options being put forward. Although boasting a core set of programs that are run as a ‘platform’, Ryu is better thought of as a toolbox, with which SDN controller functionality can be built.

Ryu is a component-based software defined networking framework. It provides software components with well defined API that make it easy for developers to create new network management and control applications. Ryu means “flow” in Japanese and is pronounced “ree-yooh”.

Architecture

A Ryu SDN controller composes of these components:

Comparison of Software Defined Networking (SDN) Controllers. Ryu Diagram
  • Southbound interfaces allow communication of SDN switches and controllers
  • Its core supports limited applications (e.g. Topology discovery, Learning switch) and libraries
  • External applications can deploy network policies to data planes via well-defined northbound APIs such as REST

Modularity and Extensibility

Ryu is structured differently from other solutions in that it provides simple supporting infrastructure that users of the platform must write code to utilise as desired. While this requires development expertise, it also allows complete flexibility of the SDN solution.

Scalability

Ryu does not have an inherent clustering ability and requires external tools to share the network state and allow failover between cluster members.

Cluster Scalability

  • External tools such as Zookeeper distribute a desired state. Extra instances of the controller can be started independently as long as the backing configuration remains identical.

Architectural Scalability

  • While Ryu supports high availability via a Zookeeper component, it does not yet support a co-operative cluster of controllers.

Interfaces

  • Southbound: It supports multiple southbound protocols for managing devices, such as OpenFlow, NETCONF, OF-Config, and partial support of P4
  • Northbound: Offer RESTful APIs only, which are limited compared to ONOS and ODL

Telemetry

Ryu doesn’t provide any telemetry functionality. This needs to be provided via external tools.

Resilience and Fault Tolerance

Ryu has no inbuilt clustering mechanism, instead relying on external tools to maintain availability. High availability is achieved by running multiple, identically configured instances, or a single instance controlled by an external framework that detects and restarts failed nodes.

Fault tolerance can be provided by Zookeeper for monitoring the controllers in order to detect controller’s failure and sharding state between cluster members.

Programming Language

Ryu is written in Python.

Community

An active community developing the framework, it is a well supported and targeted controller.

Conclusion

Ryu is like a toolbox with software components, which provides the SDN controller functionality. It has a support of various southbound interfaces for managing network devices. It is very popular in academia and has been used in OpenStack as a Network controller.
Next, we will be evaluating Faucet.

SDN Controller Comparisons:

Remove the complexity of networking at scale.
Learn more about our SDN & NFV solutions.

Learn More

The post Comparison of Software Defined Networking (SDN) Controllers. Part 5: Ryu appeared first on Aptira.

by Farzaneh Pakzad at July 29, 2019 01:19 PM

July 26, 2019

Ed Leafe

Why OpenStack Failed, or How I Came to Love the Idea of a BDFL

OK, so the title of this is a bit clickbait-y, but let me explain. By some measures, OpenStack is a tremendous success, being used to power several public clouds and many well-known businesses. But it has failed to become a powerful player in the cloud space, and I believe the reason is not technical in … Continue reading "Why OpenStack Failed, or How I Came to Love the Idea of a BDFL"

by ed at July 26, 2019 04:56 PM

Chris Dent

Placement Update 19-29

Welcome to a rushed pupdate 19-29. My morning was consumed by other things.

A reminder: The Placement project holds office hours every Wednesday at 1500 UTC in the #openstack-placement IRC channel. If you have a topic that needs some synchronous discussion, then is an ideal time. Just start talking!

Most Important

The two main things on the Placement radar are implementing Consumer Types and cleanups, performance analysis, and documentation related to nested resource providers.

What's Changed

  • The api-ref has moved to docs.openstack.org from developer.openstack.org. Redirects are in place.

  • Both traits and resource classes are now cached per request, allowing for some name to id and id to name optimizations.

  • A new zuul template is being used in placement that means fewer irrelevant tempest tests are run on placement changes.

Stories/Bugs

(Numbers in () are the change since the last pupdate.)

There are 21 (-1) stories in the placement group. 0 (0) are untagged. 2 (0) are bugs. 5 (0) are cleanups. 10 (0) are rfes. 4 (0) are docs.

If you're interested in helping out with placement, those stories are good places to look.

osc-placement

osc-placement is currently behind by 12 microversions.

  • https://review.opendev.org/666542 Add support for multiple member_of. There's been some useful discussion about how to achieve this, and a consensus has emerged on how to get the best results.

Main Themes

Consumer Types

Adding a type to consumers will allow them to be grouped for various purposes, including quota accounting.

Cleanup

Cleanup is an overarching theme related to improving documentation, performance and the maintainability of the code. The changes we are making this cycle are fairly complex to use and are fairly complex to write, so it is good that we're going to have plenty of time to clean and clarify all these things.

Other Placement

Miscellaneous changes can be found in the usual place.

There are two os-traits changes being discussed. And zero os-resource-classes changes (yay!).

Other Service Users

New discoveries are added to the end. Merged stuff is removed. Anything that has had no activity in 4 weeks has been removed.

End

If we get the chance, it will be interesting to start working with placement with 1 million providers. Just to see.

by Chris Dent at July 26, 2019 03:40 PM

OpenStack Superuser

How to run a packaged function with Qinling

In my previous post about Qinling I explained how to run a simple function, how to get the output returned by this one and what Qinling really does behind the scenes from a high-level perspective.

Here I’ll explain how to run a packaged function including external Python libraries from PyPi (the Python Package Index), or your own repository or even directly from the code itself like a sub-package or other options.

The main difference between a simple function and a packaged function is that with a simple function you are limited with the libraries/packages installed within the runtime, from a serverless perspective.

Most of the time, the only the built-in packages available (JSON, HTTP, etc…) allow you to do the basics but will constrain your creativity — and we don’t want that!

Qinling can distinguish the difference between the two when you create the creation.

A function a bit more complex this time

Compared to previous post, this function will be a little bit more complex — but not too much, don’t worry. It’s written in Python 3, so just to reiterate, you’ll need Python 3 runtime.

This function will just return information about a CIDR, by default no argument is required but I’ll explain how to override the default one by using the openstack function create execution.

import json
from IPy import IP


def details(cidr="192.168.0.0/24", **kwargs):
    network = IP(cidr)
    version = network.version()
    iptype = network.iptype().lower()
    reverse = network.reverseName()
    prefix = network.prefixlen()
    netmask = str(network.netmask())
    broadcast = str(network.broadcast())
    length = network.len()

    payload = {"ip_version": version, "type": iptype, "reverse": reverse,
               "prefix": prefix, "netmask": netmask, "broadcast": broadcast,
               "length": length, "cidr": cidr}

    print("----------------------")
    print("Function:", details.__name__)
    print("JSON payload:", payload)
    print("----------------------\n")

    return build_json(payload)


def build_json(data):
    indentation_level = 4

    print("----------------------")
    print("Function:", build_json.__name__)
    print("JSON options")
    print("  - indentation:", indentation_level)
    print("  - sort: yes")
    print(json.dumps(data, sort_keys=True, indent=indentation_level))
    print("----------------------")

    return data

The important part in this code is line 2, the import of IPy library which doesn’t exist in the runtime. If this code is uploaded like that, then the function execution will fail.

To make this work, the library needs to be at the same level as the ip_range.py file.

$ mkdir ~/qinling
$ wget -O ~/qinling/ip_range.py https://git.io/fj0SQ
$ pip install IPy -t ~/qinling

The ~/qinling directory should looks like this after the previous commands:

$ ip_range.py  IPy-1.0.dist-info  IPy.py  __pycache__

Just a quick warning: the pip command used should be the same version as the one from the runtime, if not some surprises are expected.

The next step is to generate an archive. Qinling has a restriction on the format of the archive, it has to be a ZIP archive generated with the zip command[1].

$ cd ~/qinling/
$ zip -r9 ~/qinling/ip_range.zip ~/qinling/

Run the best function ever ^^

As mentioned above, Qinling has a mechanism to determine whether you’re running a package or not. There are four options available:

  • file: used only with a file, hello_qinling.py
  • package: used only with a ZIP archive, ip_range.zip
  • container/object: will be discussed in a different Medium post
  • image: will be discussed in a different Medium post

So, did you guess which one will be the winner this time? Well… package!

The file option is kind of a “wrapper,” based on python-qinlingclient code[2] when this option is selected then the client get the filename, remove the extension and create a ZIP archive.

$ openstack function create --name func-pkg-1 --runtime python3 --entry ip_range.details --package ~/qinling/ip_range.zip

If the wrong option is used let say --file for a package then the function will not be executed properly and an error will be raised. When the function is properly created, the execution will return something like that as output value.

$ openstack function execution create 1030e1ea-2374-40a7-bfbe-216bc5966f55
| result           | {"duration": 0.036, "output": "{
    "broadcast": "192.168.0.255",
    "cidr": "192.168.0.0/24",
    "ip_version": 4,
    "length": 256,
    "netmask": "255.255.255.0",
    "prefix": 24,
    "reverse": "0.168.192.in-addr.arpa.",
    "type": "private"
}"} |

In the function, there are few print used mostly for a learning purpose, the output will be available only via the openstack function execution log show command.

$ openstack function execution log show 5f2e7d71-7b26-4ab7-9e1a-854d8850e738
Start execution: 5f2e7d71-7b26-4ab7-9e1a-854d8850e738
----------------------
Function: details
JSON payload: {'ip_version': 4, 'type': 'private', 'reverse': '0.168.192.in-addr.arpa.', 'prefix': 24, 'netmask': '255.255.255.0', 'broadcast': '192.168.0.255', 'length': 256, 'cidr': '192.168.0.0/24'}
--------------------------------------------
Function: build_json
JSON options
  - indentation: 4
  - sort: yes
{
    "broadcast": "192.168.0.255",
    "cidr": "192.168.0.0/24",
    "ip_version": 4,
    "length": 256,
    "netmask": "255.255.255.0",
    "prefix": 24,
    "reverse": "0.168.192.in-addr.arpa.",
    "type": "private"
}
----------------------
Finished execution: 5f2e7d71-7b26-4ab7-9e1a-854d8850e738

What do you think? Pretty nice, right?

Change the default CIDR value

As mentioned previously, no argument is required to execute the function. By default, the classless inter-domain routing has been hardcoded to 192.168.0.0/24 but what if you want to change it? Maybe you want to update the code, create a function, or do something else.

The solution is to use the --input option and provide a JSON hash on this one.

$ openstack function execution create 1030e1ea-2374-40a7-bfbe-216bc5966f55 --input '{"cidr": "10.0.0.0/10"}'
| result           | {"duration": 0.035, "output": "{
    "broadcast": "10.63.255.255",
    "cidr": "10.0.0.0/10",
    "ip_version": 4,
    "length": 4194304,
    "netmask": "255.192.0.0",
    "prefix": 10,
    "reverse": "0-255.10.in-addr.arpa.",
    "type": "private"
}"} |

Now run the openstack function execution log show command to see the differences between the two CIDR.

Conclusion

I’ve just demonstrated a packaged function, how to pass argument to the function and how to get the output. My journey continues…To infinity and beyond!

Resources

 

About the author

Gaëtan Trellu is a technical operations manager at Ormuco. This post first appeared on Medium.

 

Superuser is always interested in open infra community topics, get in touch at editorATopenstack.org

 

Photo // CC BY NC

The post How to run a packaged function with Qinling appeared first on Superuser.

by Gaëtan Trellu at July 26, 2019 02:02 PM

Aptira

Comparison of Software Defined Networking (SDN) Controllers. Part 4: OpenKilda

Aptira Comparison of Software Defined Networking (SDN) Controllers. OpenKilda

Our Open Source Software Defined Networking (SDN) Controller comparison continues with OpenKilda. OpenKilda is a Telstra developed OpenFlow based SDN controller currently being used in production to control the large Pacnet infrastructure. It has been shown to be successful in a distributed production environment.

Designed to solve the problem of implementing a distributed SDN control plane with a network that spans the Globe, OpenKilda solves the problem of latency while providing a scalable SDN control & data-plane and end-to-end flow telemetry.

Architecture

The Architecture of OpenKilda is shown in the figure below:

Aptira Comparison of Software Defined Networking (SDN) Controllers. OpenKilda Diagram
  • Structurally, OpenKilda uses the Floodlight software to interact with switches using OpenFlow, but pushes decision making functionality into other parts of the stack.
  • Kafka is used as a message bus for the telemetry from the Floodlight and feeds information into an Apache Storm based cluster of agents for processing.
  • Storm passes the time-series data to OpenTSDB for storing and analysing.
  • Neo4j is a graph analysis and visualisation platform.

Modularity and Extensibility

OpenKilda is built on several well-supported open-source components to implement a decentralised, distributed control plane, backed by a unique, well-designed cluster of agents to drive network updates as required. The modular nature of the architecture lends itself to being reasonably easily added new features.

Scalability

OpenKilda is able to scale process intensive profiling and decision-making functionality horizontally and independently of the control plane.

Cluster Scalability

  • OpenKilda approaches cluster scalability in a modular way. While Floodlight is used as a Southbound interface to the switch infrastructure, responsibility for PCE and telemetry processing is pushed northward into a completely separate Apache Storm based cluster. Each Floodlight instance is idempotent, with no requirement to share state. The Apache Storm cluster is by design horizontally scalable and allows throughput to be increased by adding nodes.

Architectural Scalability

  • BGP is currently not implemented and may need to be developed.

Interfaces

  • Southbound: It supports OpenFlow
  • Northbound: Offer RESTful APIs only, which are limited compared to ONOS and ODL

Telemetry

Extracting usable telemetry from the infrastructure was a core design principle of OpenKilda, so one output from the Storm agents is streams of time-series data, collected by a Hadoop backed, OpenTSDB data store. This data can be used in a multitude of ways operationally, from problem management to capacity planning.

Resilience and Fault Tolerance

OpenKilda has no inbuilt clustering mechanism, instead relying on external tools to maintain availability. High availability is achieved by running multiple, identically configured instances, or a single instance controlled by an external framework that detects and restarts failed nodes.

Programming Language

OpenKilda is written in Java.

Community

While the functionality of OpenKilda in its intended space is promising, community support is still being cultivated, leaving much of the development and maintenance burden on its current users, with feature velocity slow. OpenKilda needs your support – chat with us to get involved.

Conclusion

OpenKilda has been introduced by Telstra and is already used in production within Telstra. It has a distributed architecture and leverages other well-supported Open source projects for Telemetry processing and implementing PCE functionality. From a technical point of view, it may not be suitable for geo-redundant environment or segment routing due to the lack of BGP and MPLS tagging.
Next, we will be evaluating Ryu.

SDN Controller Comparisons:

Remove the complexity of networking at scale.
Learn more about our SDN & NFV solutions.

Learn More

The post Comparison of Software Defined Networking (SDN) Controllers. Part 4: OpenKilda appeared first on Aptira.

by Farzaneh Pakzad at July 26, 2019 01:18 PM

July 25, 2019

OpenStack Superuser

Inside open infrastructure: The latest from the OpenStack Foundation

Welcome to the latest edition of the OpenStack Foundation Open Infrastructure newsletter, a digest of the latest developments and activities across open infrastructure projects, events and users. Sign up to receive the newsletter and email community@openstack.org to contribute.

Spotlight on: Airship elections

The Airship team completed its first Technical Committee election.
The five elected members are:

  • James Gu, SUSE
  • Alexander Hughes, Accenture
  • Jan-Erik Mångs, Ericsson
  • Alexey Odinokov, Mirantis
  • Ryan van Wyk, AT&T

The Technical Committee, one of two governing bodies for Airship, is responsible for ensuring that Airship projects adhere to core principles, promote standardization and define and organize the project’s versioning and release process. The candidates represented six leaders from six different companies, a reflection of the growth of the Airship project since it launched in early 2018.

Congrats to the new Technical Committee members and also thanks to everyone who participated. Governance by community elected officials is one of the cornerstones of the Four Opens and a major step forward in the maturation of Airship. The Technical Committee is organizing its first meeting and will soon publish the schedule and agenda to the Airship mailing list.

The Technical Committee is one of two governing bodies within the Airship community. With that election wrapped up, they’ve turned their attention to the Working Committee election. The Working Committee guides the project strategy, helps arbitrate disagreements between core reviewers within a single project or between Airship projects, defines core project principles, assists in marketing and communications, provides product management, and offers ecosystem support.

Nominations for the Airship Working Committee are now open until July 30, 19:00 UTC. Anyone who has contributed to the Airship project within the last 12 months is eligible to run for the Working Committee and vote.

Visit the website for more information about how Airship can manage infrastructure deployments and life cycle. You’ll also learn more about how to get started by using Airship in a Bottle, attending one of the weekly meetings and getting involved with development.

Open Infrastructure Summit Shanghai and Project Teams Gathering (PTG)

OpenStack Foundation news

  • Registration is open. Summit tickets also grant access to the PTG. You can pay in U.S. dollars or yuan if you need an official invoice (fapiao.)
  • If your organization can’t fund your travel, apply for the Travel Support Program by August 8.
  • If you need a travel visa, get started now: Information here.
  • Put your brand in the spotlight by sponsoring the Summit: Learn more here.
  • Is your team coming to the PTG? Remember to answer the survey by August 11. If you’re a team lead and missed the email with the survey, please contact Kendall Nelson (knelson@openstack.org)

OpenStack Foundation Project News

OpenStack

  • July 25 marks the second milestone in the development  of the Train release. Feature development is now being finalized in preparation for the final release, planned for October 16.
  • The 2019 OpenStack User Survey is open until August 22. If you’re running OpenStack, please share your deployment choices and feedback.

StarlingX

  • Check out the new StarlingX main Wiki page for updates on current activities, tools, processes and how to participate in the community.
  • StarlingX and the OSF Edge Computing Group is collaborating to test minimal reference architectures to suit different edge use cases. Community members are deploying StarlingX with a distributed control architecture on hardware donated by Packet.com. See the StarlingX Wiki for more about what the deployment configuration looks like, which locations the components are running on and more.

Upcoming open infrastructure community events

August

OpenInfra Day Vietnam OpenStack Upstream Institute

September

24-26 OpenStack Day DOST, Berlin, Germany

24-26 Ansible Fest Atlanta, Georgia

Zuul booth

26-27 OpenCompute Regional Summit, Amsterdam, The Netherlands

OSF booth #B23

October

November

  • 18-21 KubeCon+CloudNativeCon, San Diego, California
  • OSF reception on Monday, November 18 at the Hilton Bayfront Hotel
  • OSF booth

Questions / feedback / contribute

This newsletter is written and edited by the OpenStack Foundation staff to highlight open infrastructure communities. We want to hear from you!
If you have feedback, news or stories that you want to share, reach us through community@openstack.org . To receive the newsletter, sign up here.

The post Inside open infrastructure: The latest from the OpenStack Foundation appeared first on Superuser.

by OpenStack Foundation at July 25, 2019 04:31 PM

Aptira

Comparison of Software Defined Networking (SDN) Controllers. Part 3: OpenDayLight (ODL)

Aptira Comparison of Software Defined Networking (SDN) Controllers. OpenDayLight ODL

Our Open Source Software Defined Networking (SDN) Controller comparison continues with OpenDayLight (ODL). ODL is more focused on the SD-LAN and Cloud integration spaces.

OpenDaylight is a modular open platform for customising and automating networks of any size and scale. The OpenDaylight Project arose out of the SDN movement, with a clear focus on network programmability. It was designed from the outset as a foundation for commercial solutions that address a variety of use cases in existing network environments.

Architecture

ODL consists of 3 layers:

Aptira Comparison of Software Defined Networking (SDN) Controllers. OpenDayLight ODL Diagram
  • Southbound plugins to communicate with the network devices
  • Core Services that can be used by means of Service Abstraction Layer (SAL) which is based on OSGi to help components going in and out of the controller while the controller is running
  • Northbound interfaces (e.g. REST/NETCONF) that allow operators to apply high-level policies to network devices or integration of ODL with other platforms

Modularity and Extensibility

Built-in mechanisms provided by ODL simplify the connection of code modules. The controller takes advantage of OSGi containers for loading bundles at runtime, allowing a very flexible approach to adding functionality.

Scalability

ODL uses a model-based approach, which implies a global, in-memory view of the network is required to perform logic calculations. ODL’s latest release further advances the platform’s scalability and robustness, with new capabilities supporting multi-site deployments for geographic reach, application performance and fault tolerance.

Cluster Scalability

  • ODL contains internal functionality for maintaining a cluster, AKKA as a distributed datastore shares the current SDN state and allows for controllers to failover in the event of a cluster partition
  • As a cluster grows however, communication and coordination activities rapidly increase, limiting performance gains per additional cluster member

Architectural Scalability

  • ODL includes native BGP routing capabilities to coordinate traffic flows between the SDN islands
  • Introduction of OpenDaylight into OpenStack provided multi-site networking while boosts networking performance

Interfaces

  • Southbound: It supports an extensive list of Southbound interfaces including OpenFlow, P4, NETCONF, SNMP, BGP, RESTCONF and PCEP.
  • Northbound: ODL offers the largest set of northbound interfaces with gRPC and RESTful APIs. The northbound interfaces supported by ODL include OSGi for applications in the same address space as the controller and the standard RESTful interface. DLUX is used to represent Northbound interfaces visually to ease integration and development work.

Telemetry

At a project level, ODL has limited telemetry related functionality. With the latest development release, there are moves toward providing northbound telemetry feeds, but they are in early design and not likely to be ready for production in the short term.

Resilience and Fault Tolerance

ODL fault tolerance mechanism is similar to ONOS, with an odd number of SDN controllers required to provide fault tolerance in the system. In the event of master node failure, a new leader would be selected to take the control of the network. The mechanism of choosing a leader is slightly different in these controllers – while ONOS focuses on eventually consistent, ODL focuses on high availability.

Programming Language

From a language perspective, ODL is written in Java.

Community

TODL is the second of the SDN controllers under the Linux Foundation Networking umbrella. This project has the largest community support of all open source SDN controllers in the market, with several big-name companies actively involved with development.

Conclusion

OpenDayLight is the most pervasive open-source SDN controller with extensive northbound and southbound APIs. In addition to resiliency and scalability, the modular architecture of ODL makes it a suitable choice for different use-cases. This is why OpenDayLight has been integrated into other open-source SDN/NFV orchestration and management solutions such as OpenStack, Kubernetes, OPNFV and ONAP which are very popular platforms in telco environments.
Next, we will be evaluating OpenKilda.

SDN Controller Comparisons:

Remove the complexity of networking at scale.
Learn more about our SDN & NFV solutions.

Learn More

The post Comparison of Software Defined Networking (SDN) Controllers. Part 3: OpenDayLight (ODL) appeared first on Aptira.

by Farzaneh Pakzad at July 25, 2019 01:32 PM

July 24, 2019

SUSE Conversations

SUSE OpenStack Cloud 9 – Now Included in SUSE YES Certification

More and more, businesses are seeking cloud solutions that provide an easy to deploy and manage, heterogeneous cloud infrastructure for provisioning development, test and production workloads in a way that is supportable, compliant and secure. In addition, they want a solution that has gone through an official certification program to give them confidence that their […]

The post SUSE OpenStack Cloud 9 – Now Included in SUSE YES Certification appeared first on SUSE Communities.

by Daryl Stokes at July 24, 2019 03:49 PM

Aptira

Comparison of Software Defined Networking (SDN) Controllers. Part 2: Open Network Operation System (ONOS)

Aptira Comparison of Software Defined Networking (SDN) Controllers. Open Network Operation System (ONOS)

We begin our Open Source Software Defined Networking (SDN) Controller comparison with the Open Network Operating System (ONOS). ONOS is designed to be distributed, stable and scalable with a focus on Service Provider networks.

The Open Network Operation System is the only SDN controller platform that supports the transition from legacy “brown field” networks to SDN “green field” networks. This enables exciting new capabilities, and disruptive deployment and operational cost points for network operators.

Architecture

ONOS is designed as a three-tier architecture as follows:

Aptira Comparison of Software Defined Networking (SDN) Controllers. Open Network Operation System (ONOS) Diagram
  • Tier 1 comprises of modules related to protocols which communicate with the network devices (Southbound in the figure)
  • Tier 2 composes of the core of ONOS and provides network state without relying on any particular protocol
  • Tier 3 comprises of applications, ONOS apps, which use network state information presented by Tier 2

Modularity and Extensibility

ONOS has built-in mechanisms for connecting/disconnecting components while the controller is running. This allows a very flexible approach to adding functionality to the controller.

Scalability

ONOS is designed specifically to horizontally scale for performance and geo-redundancy across small regions.

Cluster Scalability

  • The cluster configuration is simple, with new controllers being able to join and leave dynamically, giving flexibility over time.
  • The Atomix distributed datastore, which prioritises data consistency, should reduce the outages caused by cluster partitioning as all hosts are guaranteed to have the correct data.
  • As a cluster grows however, communication and coordination activities rapidly increase, limiting performance gains per additional cluster member.

Architectural Scalability

  • ONOS includes native BGP routing capabilities to coordinate traffic flows between the SDN islands.
  • There are several documented instances of ONOS (e.g. ICONA, SDN-IP) being used successfully in a geo-redundant architecture for controlling large scale SD-WANs.

Interfaces

  • Southbound: It supports an extensive list of Southbound interfaces including OpenFlow, P4, NETCONF, TL1, SNMP, BGP, RESTCONF and PCEP.
  • Northbound: ONOS offers the largest set of northbound interfaces with gRPC and RESTful APIs.
  • GUI: The ONOS GUI is a single-page web-application, providing a visual interface to the Open Network Operation System controller (or cluster of controllers).
  • Intent-based framework: ONOS has the implementation of the inbuilt Intent based framework. By abstracting a network service into a set of criteria a flow should meet, the generation of the underlying OpenFlow (or P4) configuration is handled internally, with the client system specifying only what the functional outcome should be.

Telemetry

Telemetry feeds are available through pluggable modules that come with the software, with Influx DB and Grafana plug-ins included in the latest release.

Resilience and Fault Tolerance

ONOS has a very simple administration mechanism for clusters with native commands for adding and removing members.

The Open Network Operation System provides fault tolerance in the system with an odd number of SDN controllers. In the event of Master node failure, a new leader would be selected to take the control of the network.

Programming Language

ONOS is written in Java.

Community

The Open Network Operation System is supported under the Linux Foundation Networking umbrella and boasts a large developer and user community.

Conclusion

Given this evaluation, the Open Network Operation System is a suitable choice for Communication Service Providers (CSP). This is because ONOS supports an extensive list of northbound and southbound APIs so vendors do not have to write their own protocol to configure their devices. It also supports the YANG model which enables vendors to write their applications against this model. The scalability of ONOS make it highly available and resilient against failure which increases the customer user experience. Finally, the software modularity features of ONOS allows users to easily customise, read, test and maintain.
Next, we will be evaluating OpenDayLight.

SDN Controller Comparisons:

Remove the complexity of networking at scale.
Learn more about our SDN & NFV solutions.

Learn More

The post Comparison of Software Defined Networking (SDN) Controllers. Part 2: Open Network Operation System (ONOS) appeared first on Aptira.

by Farzaneh Pakzad at July 24, 2019 02:52 AM

About

Planet OpenStack is a collection of thoughts from the developers and other key players of the OpenStack projects. If you are working on OpenStack technology you should add your OpenStack blog.

Subscriptions

Last updated:
September 21, 2019 09:37 PM
All times are UTC.

Powered by:
Planet