In the field of image generation, multi-layer image generation technology is gradually changing the way users interact with generative models, allowing users to isolate, select and edit specific image layers. Recently, Microsoft researchers have launched a new technology called "Anonymous Region Transformer" (ART), which can directly generate variable multi-layer transparent images based on global text cues and anonymous regional layout.
ART's design inspiration comes from "schema theory", which allows the generative model to independently decide which visual information aligns with which text information. This method is in sharp contrast to previous semantic layouts. Traditional semantic layouts usually require clear correspondence, while ART's anonymous area layout provides greater flexibility.
It is worth noting that ART introduces a layer-by-layer area-pinning mechanism that selects only visual information related to each anonymous region, thereby significantly reducing the cost of attention calculations. This method not only speeds up the generation speed, making it more than 12 times faster than the full attention method, but also effectively reduces conflicts between layers and can handle image generation at more than 50 different levels.
In addition, ART also proposed a high-quality multi-layer transparent image autoencoder that supports the transparency of variable multi-layer images directly encoded and decoded in a joint manner. This innovative design provides new possibilities for precise control and scalable layer generation, further driving the development of interactive content creation.
Project: https://art-msra.github.io/