Ansible Extension

_images/cell_division.png

An experimental extension to Ansible is included that implements host connections over Mitogen, replacing embedded shell invocations with pure-Python equivalents invoked via highly efficient remote procedure calls tunnelled over SSH. No changes are required to the target hosts.

The extension isn’t nearly in a generally dependable state yet, however it already works well enough for testing against real-world playbooks. Bug reports in this area are very welcome – Ansible is a huge beast, and only significant testing will prove the extension’s soundness.

Divergence from Ansible’s normal behaviour is considered a bug, so please report anything you notice, regardless of how inconsequential it may seem.

Overview

You should expect a 1.25x - 7x speedup and a CPU usage reduction of at least 2x, depending on network conditions, the specific modules executed, and time spent by the target host already doing useful work. Mitogen cannot speed up a module once it is executing, it can only ensure the module executes as quickly as possible.

  • A single SSH connection is used for each target host, in addition to one sudo invocation per distinct user account. Subsequent playbook steps always reuse the same connection. This is much better than SSH multiplexing combined with pipelining, as significant state can be maintained in RAM between steps, and the system logs aren’t filled with spam from repeat SSH and sudo invocations.
  • A single Python interpreter is used per host and sudo account combination for the duration of the run, avoiding the repeat cost of invoking multiple interpreters and recompiling imports, saving 300-800 ms for every playbook step.
  • Remote interpreters reuse Mitogen’s module import mechanism, caching uploaded dependencies between steps at the host and user account level. As a consequence, bandwidth usage is consistently an order of magnitude lower compared to SSH pipelining, and around 5x fewer frames are required to traverse the wire for a run to complete successfully.
  • No writes to the target host’s filesystem occur, unless explicitly triggered by a playbook step. In all typical configurations, Ansible repeatedly rewrites and extracts ZIP files to multiple temporary directories on the target host. Since no temporary files are used, security issues relating to those files in cross-account scenarios are entirely avoided.

Demo

This demonstrates Ansible running a subset of the Mitogen integration tests concurrent to an equivalent run using the extension.

Testimonials

  • “With mitogen my playbook runtime went from 45 minutes to just under 3 minutes. Awesome work!”
  • “The runtime was reduced from 1.5 hours on 4 servers to just under 3 minutes. Thanks!”
  • “Oh, performance improvement using Mitogen is huge. As mentioned before, running with Mitogen enables takes 7m36 (give or take a few seconds). Without Mitogen, the same run takes 19m49! I’m not even deploying without Mitogen anymore :)”
  • Works like a charm, thank you for your quick response”
  • “I tried it out. He is not kidding about the speed increase.”
  • “I don’t know what kind of dark magic @dmw_83 has done, but his Mitogen strategy took Clojars’ Ansible runs from 14 minutes to 2 minutes. I still can’t quite believe it.”

Installation

Caution

Thoroughly review the list of limitations before use, and do not test the prototype in a live environment until this notice is removed.

  1. Verify Ansible 2.4 and Python 2.7 are listed in the output of ansible --version

  2. Download and extract https://github.com/dw/mitogen/archive/master.zip

  3. Modify ansible.cfg:

    [defaults]
    strategy_plugins = /path/to/mitogen-master/ansible_mitogen/plugins/strategy
    strategy = mitogen_linear
    

    The strategy key is optional. If omitted, you can set the ANSIBLE_STRATEGY=mitogen_linear environment variable on a per-run basis. Like mitogen_linear, the mitogen_free strategy also exists to mimic the built-in free strategy.

  4. Cross your fingers and try it.

Limitations

This is a proof of concept: issues below are exclusively due to code immaturity.

High Risk

  • Transfer of large files using certain Ansible-internal APIs, such as triggered via the copy module, will cause corresponding memory and CPU spikes on both host and target machine, due to delivering the file as a single message. If many machines are targetted, the controller could easily exhaust available RAM. This will be fixed soon as it’s likely to be tickled by common playbooks.

Low Risk

  • Only Ansible 2.4 is being used for development, with occasional tests under 2.5, 2.3 and 2.2. It should be more than possible to fully support at least 2.3, if not also 2.2.
  • Only the sudo become method is available, however adding new methods is straightforward, and eventually at least su will be included.
  • The extension’s performance benefits do not scale perfectly linearly with the number of targets. This is a subject of ongoing investigation and improvements will appear in time.
  • “Module Replacer” style modules are not yet supported. These rarely appear in practice, and light Github code searches failed to reveal many examples of them.

Behavioural Differences

  • Ansible permits up to forks SSH connections to be setup simultaneously, whereas in Mitogen this is handled by a thread pool. Eventually this pool will become per-CPU, but meanwhile, a maximum of 16 SSH connections may be established simultaneously by default. This can be increased or decreased setting the MITOGEN_POOL_SIZE environment variable.

  • Mitogen treats connection timeouts for the SSH and become steps of a task invocation separately, meaning that in some circumstances the configured timeout may appear to be doubled. This is since Mitogen internally treats the creation of an SSH account context separately to the creation of a sudo account context proxied via that SSH account.

    A future revision may detect a sudo account context created immediately following its parent SSH account, and try to emulate Ansible’s existing timeout semantics.

  • Local commands are executed in a reuseable Python interpreter created identically to interpreters used on remote hosts. At present only one such interpreter per become_user exists, and so only one local action may be executed simultaneously per local user account.

    Ansible usually permits up to ansible.cfg:forks simultaneous local actions. Any long-running local actions that execute for every target will experience artificial serialization, causing slowdown equivalent to task_duration * num_targets. This will be fixed soon.

  • Asynchronous jobs exist only for the duration of a run, and cannot be queried by subsequent ansible-playbook invocations. Since the ability to query job IDs across runs relied on an implementation detail, it is not expected this will break any real-world playbooks.

How Modules Execute

Ansible usually modifies, recompresses and reuploads modules every time they run on a target, work that must be repeated by the controller for every playbook step.

With the extension any modifications are done on the target, allowing pristine copies of modules to be cached, reducing the necessity to re-transfer modules for each invocation. Unmodified modules are uploaded once on first use and cached in RAM for the remainder of the run.

Binary
Native executables detected using a complex heuristic. Arguments are supplied as a JSON file whose path is the sole script parameter.
Module Replacer
Python scripts detected by the presence of #<<INCLUDE_ANSIBLE_MODULE_COMMON>> appearing in their source. This type is not yet supported.
New-Style
Python scripts detected by the presence of from ansible.module_utils. appearing in their source. Arguments are supplied as JSON written to sys.stdin of the target interpreter.
JSON_ARGS
Detected by the presence of INCLUDE_ANSIBLE_MODULE_JSON_ARGS appearing in the script source. The interpreter directive (#!interpreter) is adjusted to match the corresponding value of {{ansible_*_interpreter}} if one is set. Arguments are supplied as JSON mixed into the script as a replacement for INCLUDE_ANSIBLE_MODULE_JSON_ARGS.
WANT_JSON
Detected by the presence of WANT_JSON appearing in the script source. The interpreter directive is adjusted as above. Arguments are supplied as a JSON file whose path is the sole script parameter.
Old Style
Files not matching any of the above tests. The interpreter directive is adjusted as above. Arguments are supplied as a file whose path is the sole script parameter. The format of the file is "key=repr(value)[ key2=repr(value2)[ ..]] ".

Sample Profiles

Local VM connection

This demonstrates Mitogen vs. connection pipelining to a local VM, executing the 100 simple repeated steps of run_hostname_100_times.yml from the examples directory. Mitogen requires 43x less bandwidth and 4.25x less time.

_images/run_hostname_100_times.png

Kathmandu to Paris

This is a full Django application playbook over a ~180ms link between Kathmandu and Paris. Aside from large pauses where the host performs useful work, the high latency of this link means Mitogen only manages a 1.7x speedup.

Many early roundtrips are due to inefficiencies in Mitogen’s importer that will be fixed over time, however the majority, comprising at least 10 seconds, are due to idling while the host’s previous result and next command are in-flight on the network.

The initial extension lays groundwork for exciting structural changes to the execution model: a future version will tackle latency head-on by delegating some control flow to the target host, melding the performance and scalability benefits of pull-based operation with the management simplicity of push-based operation.

_images/costapp.png

SSH Variables

Matching Ansible’s existing model, these variables are treated on a per-task basis, causing establishment of additional reuseable interpreters as necessary to match the configuration of each task.

This list will grow as more missing pieces are discovered.

  • ansible_ssh_timeout
  • ansible_host, ansible_ssh_host
  • ansible_user, ansible_ssh_user
  • ansible_port, ssh_port
  • ansible_ssh_executable, ssh_executable
  • ansible_ssh_private_key_file
  • ansible_ssh_pass, ansible_password (default: assume passwordless)
  • ssh_args, ssh_common_args, ssh_extra_args

Sudo Variables

  • ansible_python_interpreter
  • ansible_sudo_exe, ansible_become_exe
  • ansible_sudo_user, ansible_become_user (default: root)
  • ansible_sudo_pass, ansible_become_pass (default: assume passwordless)
  • sudo_flags, become_flags
  • ansible.cfg: timeout

Docker Variables

Note: Docker support is only intended for developer testing, it might disappear entirely prior to a stable release.

  • ansible_host

Chat on IRC

Some users and developers hang out on the #mitogen channel on the FreeNode IRC network.

Debugging

Normally with Ansible, diagnostics and use of the logging package output on the target machine are discarded. With Mitogen, all of this is captured and returned to the host machine, where it can be viewed as desired with -vvv. Basic high level logs are produced with -vvv, with logging of all IO on the controller with -vvvv or higher.

Although use of standard IO and the logging package on the target is forwarded to the controller, it is not possible to receive IO activity logs, as the processs of receiving those logs would would itself generate IO activity. To receive a complete trace of every process on every machine, file-based logging is necessary. File-based logging can be enabled by setting MITOGEN_ROUTER_DEBUG=1 in your environment.

When file-based logging is enabled, one file per context will be created on the local machine and every target machine, as /tmp/mitogen.<pid>.log.

Implementation Notes

Interpreter Reuse

The extension aggressively reuses the single target Python interpreter to execute every module. While this generally works well, it violates an unwritten assumption regarding Ansible modules, and so it is possible a buggy module could cause a run to fail, or for unrelated modules to interact with each other due to bad hygiene.

Before reporting a bug relating to a module behaving incorrectly, please re-run your playbook with -e mitogen_task_isolation=fork to see if the problem abates. This may also be set on a per-task basis:

- name: My task.
  broken_module:
    some_option: true
  vars:
    mitogen_task_isolation: fork

If forking fixes your problem, please report a bug regardless, as an internal list can be updated to prevent users bumping into the same problem in future.

Interpreter Recycling

The extension limits the number of persistent interpreters in use. When the limit is reached, the youngest interpreter is terminated before starting a new interpreter, preventing situations like below from triggering memory exhaustion.

- hosts: corp_boxes
  vars:
    user_directory: [
      # 10,000 corporate user accounts
    ]
  tasks:
    - name: Create user bashrc
      become: true
      vars:
        ansible_become_user: "{{item}}"
      copy:
        src: bashrc
        dest: "~{{item}}/.bashrc"
      with_items: "{{user_directory}}"

This recycling does not occur for direct connections from the controller, and it is keyed on a per-target basis, i.e. up to 20 interpreters may exist for each directly connected target.

The youngest interpreter is chosen to preserve useful accounts, like “root” or “postgresql” that tend to appear early in a run, however it is simple to construct a playbook that defeats this strategy. A future version will key interpreters on the identity of their creating task, file and/or playbook, avoiding useful account recycling in every scenario.

To raise or lower the limit from 20, set the MITOGEN_MAX_INTERPRETERS environment variable to a new value.

Runtime Patches

Three small runtime patches are employed in strategy.py to hook into desirable locations, in order to override uses of shell, the module executor, and the mechanism for selecting a connection plug-in. While it is hoped the patches can be avoided in future, for interesting versions of Ansible deployed today this simply is not possible, and so they continue to be required.

The patches are concise and behave conservatively, including by disabling themselves when non-Mitogen connections are in use. Additional third party plug-ins are unlikely to attempt similar patches, so the risk to an established configuration should be minimal.

Standard IO

Ansible uses pseudo TTYs for most invocations, to allow it to handle typing passwords interactively, however it disables pseudo TTYs for certain commands where standard input is required or sudo is not in use. Additionally when SSH multiplexing is enabled, a string like Shared connection to localhost closed\r\n appears in stderr of every invocation.

Mitogen does not naturally require either of these, as command output is embedded within the SSH stream, and it can simply call pty.openpty() in every location an interactive password must be typed.

A major downside to Ansible’s behaviour is that stdout and stderr are merged together into a single stdout variable, with carriage returns inserted in the output by the TTY layer. However ugly, the extension emulates all of this behaviour precisely, to avoid breaking playbooks that expect certain text to appear in certain variables with certain linefeed characters.

See Ansible#14377 for related discussion.

Flag Emulation

Mitogen re-parses sudo_flags, become_flags, and ssh_flags using option parsers extracted from sudo(1) and ssh(1) in order to emulate their equivalent semantics. This allows:

  • robust support for common ansible.cfg tricks without reconfiguration, such as forwarding SSH agents across sudo invocations,
  • reporting on conflicting flag combinations,
  • reporting on unsupported flag combinations,
  • internally special-casing certain behaviour (like recursive agent forwarding) without boring the user with the details,
  • avoiding opening the extension up to untestable scenarios where users can insert arbitrary garbage between Mitogen and the components it integrates with,
  • precise emulation by an alternative implementation, for example if Mitogen grew support for Paramiko.