Open Stack

Wed Oct 30 20:58:51 UTC 2019

Hi Fred,

To confirm what Julia said:

We currently have ~3700 physical nodes in Ironic, managed by 3 controllers
(16GB VMs running httpd, conductor, and inspector). We recently moved to
larger nodes for these controllers due to the "thundering image” problem Julia
was mentioning: when we deployed ~100 nodes in parallel, the conductors
were running out of memory. We have yet to see if that change has the desired
effect, though: we will add another 1000 nodes or so over the coming weeks.
As for you, this is all with iscsi deploy. We didn’t set up things up with ‘direct'
initially as we didn’t have a swift endpoint, but if this problem persists we will
look into this as ‘direct' will clearly scale better.

The recently added parallelism in Ironic's power sync in Ironic sped up this
sync loop significantly: while the loops were running into each other before,
the conductors can now check each of their 1000+ servers in <60 seconds.

Cheers,
 Arne

> On 29 Oct 2019, at 15:29, Julia Kreger <juliaashleykreger at gmail.com> wrote:
> 
> That is great news to hear that you've been able to correlate it.
> We've written some  things regarding scaling, but the key really
> depends on your architecture and how your utilizing the workload.
> Since you mentioned a spine-leaf architecture, physical locality of
> conductors will matter as well as having as much efficiency as
> possible. I believe CERN is running 4-5 conductors to manage ?3000+?
> physical machines. Naturally you'll need to scale as appropriate to
> your deployment pattern. If much of your fleet is being redeployed
> often, you may wish to consider having more conductors to match that
> overall load.
> 
> 1) Use the ``direct`` deploy interface. This moves the act of
> unpacking the image files and streaming them to disk to the end node.
> This generally requires an HTTP(S) download endpoint offered by the
> conductor OR via Swift. Ironic-Python-Agent downloads the file, and
> unpacks it in memory and directly streams it to disk. With the
> ``iscsi`` interface, you can end up in situations, depending on image
> composition and settings being passed to dd, where part of your deploy
> process is trying to write zeros over the wire in blocks to the remote
> disk. Naturally this needlessly consumes IO Bandwidth.
> 2) Once your using the ``direct`` deploy_interface, Consider using
> caching. While we don't use it in CI, ironic does have the capability
> to pass configuration for caching proxy servers. This is set on a
> per-node basis. If you have any deployed proxy/caching servers on your
> spine or in your leafs close to physical nodes. Some timers are also
> present to enable ironic to re-use swift URLs if your deploying the
> same image to multiple servers concurrently. Swift tempurl usage does
> negatively impact the gain over using a caching proxy though, but it
> is something to consider in your architecture and IO pattern.
> https://docs.openstack.org/ironic/latest/admin/drivers/ipa.html#using-proxies-for-image-download
> 3) Consider using ``conductor_groups``. If it would help, you can
> localize conductors to specific pools of machines of machines. This
> may be useful if you have pools with different security requirements,
> or if you have multiple spines and can dedicate some conductors per
> spine. https://docs.openstack.org/ironic/latest/admin/conductor-groups.html
> 4) Turn off periodic driver tasks for drivers your not using. Power
> sync, and sensor data collection are two periodic workers that consume
> resources when they run and the periodic tasks of other drivers still
> consume a worker slot and query the database to see if there is work
> to be done. You may also want to increase the number of permitted
> workers.
> 
> Power sync can be a huge issue on older versions. I believe Stein is
> where we improved the parallelism of the power sync workers in Ironic
> and Train now has power state callback with nova, which will greatly
> reduce the ironic-api and nova-compute processor overhead.
> 
> Hope this helps!
> 
> -Julia
> 
> On Mon, Oct 28, 2019 at 3:26 PM fsbiz at yahoo.com <fsbiz at yahoo.com> wrote:
>> 
>> Thanks Julia.
>> In addition to what you mentioned this particular issue seems to have cropped up when we added 100 more baremetal nodes.
>> 
>> I've also narrowed down the issue (TFTP timeouts) when 3-4 baremetal nodes are in "deploy" state and downloading the OS via iSCSI.  Each iSCSI transfer takes about 6 Gbps and thus with four transfers we are over our 20Gbps capacity of the leaf-spine links.    We are slowly migrating to iPXE so it should help.
>> 
>> That being said is there a document on large scale ironic design architectures?
>> We are looking into a DC design (primarily for baremetals) for upto 2500 nodes.
>> 
>> thanks,
>> Fred,
>> 
>> 
>> On Wednesday, October 23, 2019, 03:19:41 PM PDT, Julia Kreger <juliaashleykreger at gmail.com> wrote:
>> 
>> 
>> Greetings Fred!
>> 
>> Reply in-line.
>> 
>> On Tue, Oct 22, 2019 at 12:47 PM fsbiz at yahoo.com <fsbiz at yahoo.com> wrote:
>> 
>> [trim]
>> 
>> 
>> 
>> TFTP logs: shows TFTP client timed out (weird).  Any pointers here?
>> 
>> 
>> Sadly this is one of those things that comes with using TFTP. Issues like this is why the community tends to recommend using ipxe.efi to chainload as you can perform transport over TCP as opposed to UDP where in something might happen mid-transport.
>> 
>> 
>> tftpd shows ramdisk_deployed completed.  Then, it reports that the client timed out.
>> 
>> 
>> Grub does tend to be very abrupt and not wrap up very final actions. I suspect it may just never be sending the ack back and the transfer may be completing. I'm afraid this is one of those things you really need to see on the console what is going on. My guess would be that your deploy_ramdisk lost a packet in transfer or that it was corrupted in transport. It would be interesting to know if the network card stack is performing checksum validation, but for IPv4 it is optional.
>> 
>> 
>> [trim]
>> 
>> 
>> 
>> This has me stumped here.  This exact failure seems to be happening 3 to 4 times a week on different nodes.
>> Any pointers appreciated.
>> 
>> thanks,
>> Fred.
>> 
>> 
> 

Open Stack

[ironic]: Timeout reached while waiting for callback for node

OpenStack

Community

Documentation

Branding & Legal