everytime after you useOmost , your image is almost there

Omost provides LLMs models that will write codes to compose image visual contents with Omost's virtual Canvas agent. This Canvas can be rendered by specific implementations of image generators to actually generate images.

Get started

you can use the below deployment (requires 8GB Nvidia VRAM):
git clone https://github.com/lllyasviel/Omost.git
cd Omost
conda create -n omost python=3.10
conda activate omost
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
python gradio_app.py
(Note that quant LLM requires `bitsandbytes` - some 9XX or 10XX or 20XX GPUs may have trouble in running it. If that happens, just use our official huggingface space.)

Screenshots

All with random seed 12345

a ragged man wearing a tattered jacket in the nineteenth century:

screenshot 1
screenshot 2

the greatest escape after the apocalypse, the hope of all humanity:

screenshot 3
screenshot 4

jurassic dinosaur battle:

screenshot 5
screenshot 6

the supreme ruler with endless magic on a gothic throne, with undead armies kneeling:

screenshot 7
screenshot 8

a funny cartoon batman fights joker:

screenshot 9
screenshot 10

Example Transcripts

Below is a raw transcript from default model, with random seed 123456

User

generate an image of the fierce battle of warriors and the dragon

Assistant


# Initialize the canvas
canvas = Canvas()

# Set a global description for the canvas
canvas.set_global_description(
    description='A fierce battle between warriors and a dinosaur.',
    detailed_descriptions=[
        'In this intense scene, a group of fierce warriors is engaged in an epic battle with a mighty dinosaur.',
        'The warriors, clad in armor and wielding swords and shields, are positioned on the left side of the image.',
        'Their expressions are determined and focused, reflecting their resolve to defeat the dinosaur.',
        'The dinosaur, with its massive body and powerful legs, dominates the center of the image.',
        'Its scales glisten with a metallic sheen, and its eyes burn with a fierce intensity.',
        'The background is a dark, stormy sky with flashes of lightning, adding to the drama and tension of the battle.',
        'The ground is littered with debris and the remnants of previous battles, enhancing the sense of a long and brutal conflict.',
        'The overall atmosphere is one of chaos and intensity, with the warriors and the dinosaur locked in a fierce struggle for survival.',
    ],
    tags='battle, warriors, dinosaur, fierce, armor, swords, shields, determined, focused, epic, intense, metallic, glistening, stormy sky, lightning, debris, conflict, struggle, survival, chaos, tension, drama, powerful legs, scales, eyes, burning, illuminated',
    HTML_web_color_name='darkslategray',
)

# Add a group of fierce warriors.
canvas.add_local_description(
    location='on the left',
    offset='no offset',
    area='a large horizontal area',
    distance_to_viewer=5.0,
    description='A group of fierce warriors.',
    detailed_descriptions=[
        'The warriors, clad in gleaming armor, are positioned on the left side of the image.',
        'They are armed with swords, shields, and spears, ready for battle.',
        'Their faces are set with determination and focus, reflecting their resolve to defeat the dinosaur.',
        'Some warriors are in mid-action, swinging their swords or shields, while others stand ready to strike.',
        'The armor they wear is intricately designed, with symbols and patterns that signify their rank and bravery.',
        'The ground beneath them is uneven and littered with debris, adding to the sense of a chaotic and intense battle.',
    ],
    tags='warriors, armor, swords, shields, spears, determined, focused, mid-action, intricate design, symbols, patterns, rank, bravery, uneven ground, debris, chaotic, intense, battle',
    atmosphere='Determined and focused, ready for the fierce battle.',
    style='Highly detailed and dynamic, capturing the intensity of the warriors.',
    quality_meta='High resolution with intricate details and dynamic poses.',
    HTML_web_color_name='darkgoldenrod',
)

# Add a mighty dinosaur.
canvas.add_local_description(
    location='in the center',
    offset='no offset',
    area='a large square area',
    distance_to_viewer=7.0,
    description='A mighty dinosaur.',
    detailed_descriptions=[
        'The dinosaur is a massive creature, dominating the center of the image with its powerful legs and fierce roar.',
        'Its scales glisten with a metallic sheen, reflecting the light from the stormy sky.',
        "The dinosaur's eyes burn with a fierce intensity, and its teeth are sharp and menacing.",
        'The powerful legs of the dinosaur are muscular and add to its formidable appearance.',
        'The ground shakes beneath its feet, enhancing the sense of a long and brutal conflict.',
        'The overall appearance is one of destruction and turmoil, with the dinosaur adding to the chaotic atmosphere.',
    ],
    tags='dinosaur, massive, powerful legs, fierce roar, glistening scales, metallic sheen, fierce eyes, sharp teeth, muscular legs, formidable, ground shaking, destruction, turmoil, chaos, battle',
    atmosphere='Intense and menacing, with a powerful presence.',
    style='Epic and dramatic, emphasizing the grandeur and danger of the dinosaur.',
    quality_meta='High resolution with dramatic lighting and detailed textures.',
    HTML_web_color_name='firebrick',
)

# Add a stormy sky with flashes of lightning.
canvas.add_local_description(
    location='on the top',
    offset='no offset',
    area='a large horizontal area',
    distance_to_viewer=10.0,
    description='A stormy sky with flashes of lightning.',
    detailed_descriptions=[
        'The background of the image is a dark, stormy sky filled with swirling clouds and flashes of lightning.',
        'The sky is turbulent, with clouds dark and foreboding, adding to the dramatic tension of the battle.',
        'The lightning flashes illuminate the scene, casting sharp, brief lights over the warriors and the dinosaur.',
        'The stormy sky creates a sense of chaos and unpredictability, heightening the intensity of the battle below.',
        'The overall atmosphere is one of impending doom and relentless conflict, with the storm mirroring the fierce struggle between the warriors and the dinosaur.',
    ],
    tags='stormy sky, dark clouds, lightning, turbulent, foreboding, dramatic tension, illumination, chaos, unpredictability, intensity, impending doom, relentless conflict, battle, warriors, dinosaur, swirling clouds, sharp lights, brief lights',
    atmosphere='Chaotic and intense, mirroring the fierce battle below.',
    style='Dramatic and turbulent, emphasizing the conflict and tension.',
    quality_meta='High resolution with dynamic lighting and detailed cloud textures.',
    HTML_web_color_name='midnightblue',
)

# Add a debris-covered battlefield.
canvas.add_local_description(
    location='on the bottom',
    offset='no offset',
    area='a large horizontal area',
    distance_to_viewer=5.0,
    description='A debris-covered battlefield.',
    detailed_descriptions=[
        'The ground of the battlefield is littered with debris, remnants of previous battles.',
        'Broken weapons, shattered shields, and scattered armor pieces are strewn across the battlefield.',
        'The terrain is uneven, with patches of mud and dirt, adding to the sense of a long and brutal conflict.',
        'The debris-covered battlefield enhances the chaotic and intense atmosphere of the scene, reflecting the ferocity and duration of the battle.',
        'The overall appearance is one of destruction and turmoil, with the remnants of previous battles serving as a grim reminder of the ongoing struggle.',
    ],
    tags='battlefield, debris, broken weapons, shattered shields, scattered armor, uneven terrain, mud, dirt, brutal conflict, chaos, intensity, destruction, turmoil, previous battles, ongoing struggle, remnants, ferocity, duration',
    atmosphere='Chaotic and intense, reflecting the ferocity of the battle.',
    style='Detailed and realistic, emphasizing the destruction and chaos.',
    quality_meta='High resolution with detailed textures and realistic debris.',
    HTML_web_color_name='darkolivegreen',
)

User can click the render button to actually render the image from any of those `Canvas` instances.

Symbols

All Omost LLMs are trained to obey the following symbols

class Canvas:
    def set_global_description(
            self, 
            description: str, 
            detailed_descriptions: list[str], 
            tags: str, 
            HTML_web_color_name: str
    ):
        pass

    def add_local_description(
            self, 
            location: str, 
            offset: str, 
            area: str, 
            distance_to_viewer: float, 
            description: str, 
            detailed_descriptions: list[str], 
            tags: str, 
            atmosphere: str, 
            style: str, 
            quality_meta: str, 
            HTML_web_color_name: str
    ):
        assert location in [
            "in the center", 
            "on the left", 
            "on the right", 
            "on the top", 
            "on the bottom", 
            "on the top-left", 
            "on the top-right", 
            "on the bottom-left", 
            "on the bottom-right"
        ]
        assert offset in [
            "no offset", 
            "slightly to the left", 
            "slightly to the right", 
            "slightly to the upper", 
            "slightly to the lower", 
            "slightly to the upper-left", 
            "slightly to the upper-right", 
            "slightly to the lower-left", 
            "slightly to the lower-right"
        ]
        assert area in [
            "a small square area", 
            "a small vertical area", 
            "a small horizontal area", 
            "a medium-sized square area", 
            "a medium-sized vertical area", 
            "a medium-sized horizontal area", 
            "a large square area", 
            "a large vertical area", 
            "a large horizontal area"
        ]
        assert distance_to_viewer > 0
        pass

During training, the above symbols are associated with specific concepts and use cases related to image generation.

The design is to make those codes easy to learn for LLMs, but also easy to handle for diffusion models.

Lets breakdown each part:

Function: Canvas.set_global_description and Canvas.add_local_description

They set descriptions to images. The meanings of the parameters are same for them, with `add_local_description` having more fields than `set_global_description`.

The `set_global_description` annotate entire image, while `add_local_description` annotates a part of image.

Parameter: description and detailed_descriptions

Let us introduce a concept called "sub-prompt". If a prompt is less than 75 tokens, and is self-supported to describe a thing without relying on other prompts, we call it a "sub-prompt".

The `description` is a sub-prompt, and the `detailed_descriptions` is a list of sub-prompts.

Note that each sub-prompt is strictly less than 75 tokens (and typically less than 40 tokens), you can safely encode them with any clip without worrying the truncation position affecting the semantics.

The design of sub-prompt also allows more satisfying text encoding based on greedy merge. For example, if you have

sub-prompt A: 25 tokens

sub-prompt B: 35 tokens

sub-prompt C: 5 tokens

sub-prompt D: 60 tokens

sub-prompt E: 15 tokens

sub-prompt F: 25 tokens

and since every sub-prompt is promised to be self-supported to describe a thing independently, we can use greedy method to merge them to bags like

bag 1 {A, B, C} : 65 tokens

bag 2 {D} : 60 tokens

bag 1 {E, F} : 40 tokens

where each bag is less than 75 tokens and can be encoded by any clip in one pass (and then concat them).

Encoding texts in this way will make sure that text-encoder will never make semantic truncation mistakes.

One may ask - if all sub-prompts are less than 75 tokens with independent semantics, why not just encode them without merge and then concat? This is mainly because we want the text embedding to be more coherent. For example, lets say sub-prompt A is "a man" while sub-prompt B is "handsome, professional", then merging them before encoding will give you a more mixed text embedding concept with coherent features of a handsome professional man.

All Omost LLMs are trained to give strictly well-defined sub-prompts. You can make use of these definitions to design lossless text encoding methods.

Parameter: location, offset, area

The three parameters defines a bounding box. Note that they must obey

Parameter: location, offset, area

The three parameters defines a bounding box. Note that they must obey


assert location in [
    "in the center", 
    "on the left", 
    "on the right", 
    "on the top", 
    "on the bottom", 
    "on the top-left", 
    "on the top-right", 
    "on the bottom-left", 
    "on the bottom-right"
]
assert offset in [
    "no offset", 
    "slightly to the left", 
    "slightly to the right", 
    "slightly to the upper", 
    "slightly to the lower", 
    "slightly to the upper-left", 
    "slightly to the upper-right", 
    "slightly to the lower-left", 
    "slightly to the lower-right"
]
assert area in [
    "a small square area", 
    "a small vertical area", 
    "a small horizontal area", 
    "a medium-sized square area", 
    "a medium-sized vertical area", 
    "a medium-sized horizontal area", 
    "a large square area", 
    "a large vertical area", 
    "a large horizontal area"
]

First we divide a canvas into 3*3=9 locations:

image

Then we further divide each location to 3*3 offsets, resulting in 9*9=81 positions:

image

Using these positions as centers, we further define 9 types of bounding boxes:

image

We can see that this method allows 9*9*9=729 different bounding boxes, covering almost all common possible locations of an object in the image.

One may argue that why this is necessary - why not just let the LLMs to learn pixel index or x, y coordinates - and should that be much more accurate? Below is several of my notes:

  1. I have tried several representations, including pixel index like {x = 32, y=16, w=58, h=99}, or margin pixels like {left = 32, right=15, top=27, bottom=33}, or percentage pixel index like {x = 0.124, y=0.65, w=0.335, h=0.251}, or percentage margin like {left = 0.251, right=0.154, top=0.254, bottom=0.441}. The result is that opensource LLMs are really not very good at learning these representations even for Llama3 (perhaps GPT4o can learn it). Sometimes it works sometimes it gives completely random numbers. Note that our problem is very different from MLLM. The vision-LLM usually have image embedding as inputs and in that case estimating numeric position is like a look-up table problem and can somewhat be learned, but our case is where the LLM need to generate every composition from scratch without help of any image embedding to look-up.
  2. But the natural language like "on the right", "slightly to the top-right", "a small vertical area" etc, works very well. The model converges very fast and the learning is stable. It aligns to the pretrained knowledge of LLMs very well.
  3. I have also tried adding some special tokens to represent spatial locations and also train the embedding layers. But that model is very difficult to train and debug. Also, the token-embedding-based method needs many hyperparameter tuning everytime we change the LLM - for example when changing from Llama3 to Phi, if we use the token-embedding method, we need to design training parameters again.
  4. The number 9*9*9=729 is not really a small number from the perspective of bounding box proposals. This can also be called ROI (region of interest) and some old semantic segmentation tech uses (RPN) Region Proposal Network to produce a similar number (<1000) of regions.
  5. Most region-guided diffusion methods are coarse-level methods (like multi-diffusion and attention couple and gligen), and they do not need pixel-perfect regions.
  6. These are very personal results from me - if you are working on some similar multi-modal LLM research, using pixel indices is completely okay, worth trying, and probably other training methods can also achieve a robust system.

Parameter: distance_to_viewer and HTML_web_color_name

The distance_to_viewer can be viewed as relative depth. Note that this value's absolute number is not reliable at all (because opensource LLMs are not very good at producing image-space numbers) and it should only be used in sorting elements into background-to-foreground layers.

You can always use distance_to_viewer to sort all local elements before rendering them using a diffusion model. The global description can be always viewed as the most far away background layer.

The HTML_web_color_name is one of these:


possible_HTML_web_color_names = {  # r, g, b
    'aliceblue': (240, 248, 255), 'antiquewhite': (250, 235, 215), 'aqua': (0, 255, 255),
    'aquamarine': (127, 255, 212), 'azure': (240, 255, 255), 'beige': (245, 245, 220),
    'bisque': (255, 228, 196), 'black': (0, 0, 0), 'blanchedalmond': (255, 235, 205), 'blue': (0, 0, 255),
    'blueviolet': (138, 43, 226), 'brown': (165, 42, 42), 'burlywood': (222, 184, 135),
    'cadetblue': (95, 158, 160), 'chartreuse': (127, 255, 0), 'chocolate': (210, 105, 30),
    'coral': (255, 127, 80), 'cornflowerblue': (100, 149, 237), 'cornsilk': (255, 248, 220),
    'crimson': (220, 20, 60), 'cyan': (0, 255, 255), 'darkblue': (0, 0, 139), 'darkcyan': (0, 139, 139),
    'darkgoldenrod': (184, 134, 11), 'darkgray': (169, 169, 169), 'darkgrey': (169, 169, 169),
    'darkgreen': (0, 100, 0), 'darkkhaki': (189, 183, 107), 'darkmagenta': (139, 0, 139),
    'darkolivegreen': (85, 107, 47), 'darkorange': (255, 140, 0), 'darkorchid': (153, 50, 204),
    'darkred': (139, 0, 0), 'darksalmon': (233, 150, 122), 'darkseagreen': (143, 188, 143),
    'darkslateblue': (72, 61, 139), 'darkslategray': (47, 79, 79), 'darkslategrey': (47, 79, 79),
    'darkturquoise': (0, 206, 209), 'darkviolet': (148, 0, 211), 'deeppink': (255, 20, 147),
    'deepskyblue': (0, 191, 255), 'dimgray': (105, 105, 105), 'dimgrey': (105, 105, 105),
    'dodgerblue': (30, 144, 255), 'firebrick': (178, 34, 34), 'floralwhite': (255, 250, 240),
    'forestgreen': (34, 139, 34), 'fuchsia': (255, 0, 255), 'gainsboro': (220, 220, 220),
    'ghostwhite': (248, 248, 255), 'gold': (255, 215, 0), 'goldenrod': (218, 165, 32),
    'gray': (128, 128, 128), 'grey': (128, 128, 128), 'green': (0, 128, 0), 'greenyellow': (173, 255, 47),
    'honeydew': (240, 255, 240), 'hotpink': (255, 105, 180), 'indianred': (205, 92, 92),
    'indigo': (75, 0, 130), 'ivory': (255, 255, 240), 'khaki': (240, 230, 140), 'lavender': (230, 230, 250),
    'lavenderblush': (255, 240, 245), 'lawngreen': (124, 252, 0), 'lemonchiffon': (255, 250, 205),
    'lightblue': (173, 216, 230), 'lightcoral': (240, 128, 128), 'lightcyan': (224, 255, 255),
    'lightgoldenrodyellow': (250, 250, 210), 'lightgray': (211, 211, 211), 'lightgrey': (211, 211, 211),
    'lightgreen': (144, 238, 144), 'lightpink': (255, 182, 193), 'lightsalmon': (255, 160, 122),
    'lightseagreen': (32, 178, 170), 'lightskyblue': (135, 206, 250), 'lightslategray': (119, 136, 153),
    'lightslategrey': (119, 136, 153), 'lightsteelblue': (176, 196, 222), 'lightyellow': (255, 255, 224),
    'lime': (0, 255, 0), 'limegreen': (50, 205, 50), 'linen': (250, 240, 230), 'magenta': (255, 0, 255),
    'maroon': (128, 0, 0), 'mediumaquamarine': (102, 205, 170), 'mediumblue': (0, 0, 205),
    'mediumorchid': (186, 85, 211), 'mediumpurple': (147, 112, 219), 'mediumseagreen': (60, 179, 113),
    'mediumslateblue': (123, 104, 238), 'mediumspringgreen': (0, 250, 154),
    'mediumturquoise': (72, 209, 204), 'mediumvioletred': (199, 21, 133), 'midnightblue': (25, 25, 112),
    'mintcream': (245, 255, 250), 'mistyrose': (255, 228, 225), 'moccasin': (255, 228, 181),
    'navajowhite': (255, 222, 173), 'navy': (0, 0, 128), 'navyblue': (0, 0, 128),
    'oldlace': (253, 245, 230), 'olive': (128, 128, 0), 'olivedrab': (107, 142, 35),
    'orange': (255, 165, 0), 'orangered': (255, 69, 0), 'orchid': (218, 112, 214),
    'palegoldenrod': (238, 232, 170), 'palegreen': (152, 251, 152), 'paleturquoise': (175, 238, 238),
    'palevioletred': (219, 112, 147), 'papayawhip': (255, 239, 213), 'peachpuff': (255, 218, 185),
    'peru': (205, 133, 63), 'pink': (255, 192, 203), 'plum': (221, 160, 221), 'powderblue': (176, 224, 230),
    'purple': (128, 0, 128), 'rebeccapurple': (102, 51, 153), 'red': (255, 0, 0),
    'rosybrown': (188, 143, 143), 'royalblue': (65, 105, 225), 'saddlebrown': (139, 69, 19),
    'salmon': (250, 128, 114), 'sandybrown': (244, 164, 96), 'seagreen': (46, 139, 87),
    'seashell': (255, 245, 238), 'sienna': (160, 82, 45), 'silver': (192, 192, 192),
    'skyblue': (135, 206, 235), 'slateblue': (106, 90, 205), 'slategray': (112, 128, 144),
    'slategrey': (112, 128, 144), 'snow': (255, 250, 250), 'springgreen': (0, 255, 127),
    'steelblue': (70, 130, 180), 'tan': (210, 180, 140), 'teal': (0, 128, 128), 'thistle': (216, 191, 216),
    'tomato': (255, 99, 71), 'turquoise': (64, 224, 208), 'violet': (238, 130, 238),
    'wheat': (245, 222, 179), 'white': (255, 255, 255), 'whitesmoke': (245, 245, 245),
    'yellow': (255, 255, 0), 'yellowgreen': (154, 205, 50)
}

By combining distance_to_viewer and HTML_web_color_name, you can draw a very coarse image of the composition. For example, if the LLM works well, "a green bottle in front of a red bottle on a wood table in a dark room" should make it possible for you to compute an image like:

image

You can use this image as an initial latent and use denoise strength like 0.95 to 0.99 to generate the image.

Or if you do not like this and still prefer to let diffusion models to generate from zero-mean (even when you know that most diffusion models have tsnr problems), you can ignore this image and or just use this image as a debugger.

Besides, the layer sorting can also be useful in some very special attention formulation - we will discuss this later.

Parameter: tags and atmosphere and style and quality_meta

The tags is designed as a possible replacement for the description since many diffusion models prefer tags. If used with anime models, one may hard code some logics to replace all "girl" to "1girl". If used with Pony then probably always hard code adding "score_9, score_8 ..." to this.

The atmosphere and style and quality_meta are some experimental parameters without very specific use cases. Current we can just treat them as sub-prompts and involve them in the greedy merge of sub-prompt bags. This in my experiments will improve the atmosphere and quality a bit.