VyOS Networks Blog

Building an open source network OS for the people, together.

Writing VyOS operational mode commands just got better!

Daniil Baturin
Posted 21 Aug, 2022

Hello, Community!

Writing scripts for operational mode commands used to require quite a bit of tedious to handle sub-commands and options, and there were no guidelines for an optimal script structure. Now we can automatically generate both command-line options and GraphQL resolvers thanks to Python 3.5+ introspection for type annotations, so you will never have to write that boring code by hand. Read on for details!

 Example

First, let me show you a simple example of a new-style operational mode script:

#!/usr/bin/env python3

import sys
import typing

import vyos.opmode

def show_greeting(raw: bool, recipient: typing.Optional[str]):
    if recipient is not None:
        greeting = f"hello {recipient}"
    else:
        greeting = "hello"

    if raw:
        return {"greeting": greeting}
    else:
        return greeting

if __name__ == '__main__':
    try:
        res = vyos.opmode.run(sys.modules[__name__])
        if res:
            print(res)
    except ValueError as e:
        print(e)
        sys.exit(1)

As you can see, it doesn't have any argument parser specifications or argument handling — it's all replaced by a mysterious vyos.opmode.run function call. Let's try calling the script:

$ ./hello.py 
Subcommand required!
usage: hello.py [-h] {show_greeting} ...

$ ./hello.py show_greeting --help
usage: hello.py show_greeting [-h] [--raw] [--recipient RECIPIENT]

optional arguments:
  -h, --help            show this help message and exit
  --raw
  --recipient RECIPIENT
  
$ ./hello.py show_greeting 
hello

$ ./hello.py show_greeting --raw --recipient world
{
    "greeting": "hello world"
}

As you can see, a function that starts with show_ is automatically made into a sub-command with options based on its arguments. Now let's discuss how it's done and how to rewrite existing scripts in the new style.

Background

If you’ve been following VyOS development for a while, you may still remember our first steps of the Big Rewrite. It’s popularly known as a "rewrite from Perl to Python", but language change is just one part of it. A bigger part is fixing old architecture mistakes and giving the codebase a better structure.

So far, our structure improvement effort has been focused on the configuration mode. Its standardization was an immediate success: config generation functions became testable; we could use vyos-configd to pre-load Python modules and Jinja2 templates to avoid interpreter startup performance penalty; in the future, it will allow commit dry-run,  transactional commits, and live rollbacks.

However, operational mode commands escaped our attempts to give them more structure for a long time.

There are several reasons for that. First, the operational mode codebase was a Wild West: some commands were elaborate Perl scripts, some were small shell scripts, and some were just wrappers for normal Linux commands. In the configuration mode, we had a structure to change; but for operational commands, we had to invent it from scratch.

Second, we didn’t know what we wanted from that structure in the first place. There are clear acceptance criteria for configuration mode structure: it must enable transactional commit support with a separate config verification stage. What should an ideal operational mode structure allow us to do?

As we continued to work on the HTTP API, the answer became clear: it should be easy or, ideally, even effortless to expose operational mode commands in that API. Then we started looking for ways to make it effortless.

A few months ago, I made the first awkward step and started refactoring scripts for show commands to separate data collection and formatting functions (get_raw_data and get_formatted_output). My idea was that, at the very least, we should have a way to get raw data from those scripts for unit testing and be able to return it as JSON. It was a step in the right direction but not a real solution to any problem: it was only applicable to show commands, not even all of them, but only commands that take no arguments.

If you forgot or didn’t know, we have a set of standardized operational mode words:

  • show — for displaying information about the system.
  • clear — for completely non-disruptive operations, such as clearing counters or caches.
  • reset — for disruptive operations with limited impacts, such as reset of individual BGP sessions or IPsec tunnels.
  • restart — for operations that restart entire services.
  • generate ­— for generating new objects inside the system, most commonly cryptographic material.

There are a few exceptions, such as the install image command, but most commands fall into that limited set of operations. So, technically, modules could provide functions with standardized names that an API daemon could import. The remaining question was what to do with sub-commands and their arguments. For example, in BGP we have show ip bgp neighbors <address> — the API daemon needs to be aware of that.

Then we realized that we are using sufficiently recent Python versions to have type annotation support. If so, could we encode operational mode command arguments using Python types?

It took quite a bit of experimentation, but ultimately the answer is yes. The type annotation introspection API is somewhat awkward to use in Python 3.9, but it allows us to do what we want to do. Python 3.10 made many improvements to that API, so when we upgrade the base system to have 3.10, we can make that code even simpler.

Type annotations? What type of annotations?

If you missed the whole story with type annotations in Python, here's a refresher. Python has had syntax for type hints for a long time already, but to CPython they were more like glorified docstrings, and only external tools like mypy could give them any meaning. Since version 3.5, Python supports runtime access to type annotations and provides type abstractions such as Optional through the typing module.

For example:

$ python3
>>> import typing
>>> def foo(x : int, y: typing.Optional[str]):
...     pass
... 
>>> foo.__annotations__
{'x': <class 'int'>, 'y': typing.Optional[str]}

As you can see, it's not a string "int" that it gives us, but a reference to the int class that represents the type of integer numbers in Python. The str type is hidden deeper away, but we can also access it.

Now, remember that the argparse module also takes those type references for its option validation? In almost every op mode script, we have quite a few command-line options that we have to define by hand, like here in the show dhcp" script.

But if we have a function like show_leases(pool: typing.Optional[str]), we can gain enough information from its name and type annotations to generate argparse statements to allow calling dhcp.py show_leases [--pool $name]. With a bit more effort, we could generate GraphQL schemas from the same data. And so we did.

New operational mode script structure

Script naming and function grouping

First, we often had (and still have) different scripts for different operations, like show_openvpn.py and reset_openvpn.py. That has to change: if we are to generate API endpoints, we need to group endpoints related to the same component.

We had quite a few discussions about the GraphQL API for operational mode, and we conclude that trying to group operations by action (show, reset...) and mirror the CLI in the API is neither practical nor even desirable. APIs and CLIs are used in different ways, and what's good for interactive use from the console can be an annoyance for programmatic use. One pain point is that interactive operational mode commands can have variables in the middle, like in show interfaces ethernet eth0 brief. For interactive use, "ethernet" keyword is good because it allows the user to see completions for specific interface types. For use from a script it adds no information to the call but forces the scriptwriter to look at the interface name and insert the appropriate interface type to the command.

So, CLI definitions for op mode will remain hand-written, and we will neither generate them nor generate anything else from them — GraphQL endpoints will be generated only from function names and their type annotations.

For example, for retrieving OpenVPN server sessions, the GraphQL query will be {OpenVPN { ShowServer } }, and the script call will be ${vyos_op_scripts_dir}/openvpn.py show_server.

In other words, when you rewrite old op mode scripts for OpenVPN, you should get rid of show_openvpn.py and reset_openvpn.py and merge them into src/op_mode/openvpn.py.

Function naming and types

All functions that start with operational mode "top-level words" (show, clear, reset, restart, generate) are automatically converted to sub-commands and GraphQL queries. Sub-commands mirror function names exactly, while GraphQL resolver generator converts the name to the PascalCase.

All other functions are ignored, but it's a good idea to prefix them with underscores to exclude them from the public interface.

Function arguments

Every show_* function must have a raw: bool argument. If raw is true, the function must return a dict with collected data. If it's false, it must return a string with a human-readable representation of that data.

Other functions do not have any arguments with special meanings.

Note that if a script has distinct operations, they should be implemented by different functions. For example, the current show_dhcp.py script has --leases and --statistics options. In rewritten versions, they should become show_leases(raw: bool, pool: typing.Optional[str], state: typing.Optional[typing.List[str]]) and show_statistics(raw: bool).

Error handling

Error handling in old operational mode scripts was also completely disorganized — many wouldn't even exit with non-zero status on errors. For code exposed in an API that's unacceptable — such code needs a uniform error handling approach and informative error signaling.

We drafted an exception hierarchy for new-style scripts that we should all follow:

  • On incorrect arguments: raise ValueError
  • If the service that the command deals with is not enabled in the system: raise vyos.opmode.UnconfiguredComponent
  • If data for a show command is not present in the system, but it's likely a temporary condition (e.g., during a service restart): raise vyos.opmode.DataUnavailable
  • If a shared resource is locked (e.g., there's a commit in progress): raise vyos.opmode.ResourceLocked

What's next?

These new helpers and protocols aren't set in stone, and I'm sure there are still missing use cases that will need improvement (and hopefully will not need incompatible changes!). If you have any ideas, please share them!

There is also plan to introduce two more top-level op mode words: "produce" and "execute" to make that UI more predictable and unified. If you want to join that discussion, see T4624.

Join the rewrite effort!

The foundation for the new operational mode is there, but many old-style scripts remain. To make them usable from the API and to make them easier to maintain and extend, we need to refactor them in the new style. Please join that effort, and we can make it happen much faster (and our rewards for contributors aren't going anywhere)!.

The post categories:

Comments