Abstract
In this paper, we propose a semantic-guided framework to address the challenging problem of large-mask image inpainting, where essential visual content is missing and contextual cues are limited. To compensate for the limited context, we leverage a pretrained Amodal Completion (AC) model to generate structure-aware candidates that serve as semantic priors for the missing regions. We introduce Context-Semantic Fusion Network (CSF-Net), a transformer-based fusion framework that fuses these candidates with contextual features to produce a semantic guidance image for image inpainting. This guidance improves inpainting quality by promoting structural accuracy and semantic consistency. CSF-Net can be seamlessly integrated into existing inpainting models without architectural changes and consistently enhances performance across diverse masking conditions. Extensive experiments on the Places365 and COCOA datasets demonstrate that CSF-Net effectively reduces object hallucination while enhancing visual realism and semantic alignment.
Overall Framework
Overview of the proposed CSF-Net. (a) A pretrained amodal completion model generates multiple object completions, and context-inconsistent candidates are filtered out. (b) Dual Swin-Transformer encoders extract multi-scale features from the masked image and selected candidates, which are fused via a cross-attention fusion decoder. (c) Hierarchical pixel selection is performed using structural and perceptual scores to generate the final semantic guidance image.
Key Contributions
- First work to leverage amodal completion for semantic guidance in large-mask inpainting, addressing the challenge of limited contextual cues.
- Transformer-based fusion framework that integrates contextual features and semantic priors through dual encoders and cross-attention fusion decoder.
- Plug-and-play module that seamlessly integrates into existing inpainting models without requiring any architectural modifications.
- Consistent performance improvements across diverse masking conditions on Places365 and COCOA datasets.
Technical Details
Semantic Guidance Image Generation
Overview of semantic guidance image generation of CSF-Net. This image incorporates object-level semantic priors and serves as an input to the inpainting model.
Hierarchical Pixel Selection
(a) The Structure Score Network (SSN) and Perceptual Score Network (PSN) compute confidence scores at each scale using fused features and the masked input. Multi-scale consistency is enforced via learnable coefficients β. (b) At the finest scale, the highest-scoring candidate is selected for each pixel to form the semantic guidance image.
Results
We proposed CSF-Net, a transformer-based fusion framework that introduces object-level semantic guidance into the image inpainting process. By leveraging a pretrained amodal completion model, CSF-Net generates multiple structure-aware semantic candidates, which are fused with contextual information to produce a semantic guidance image. This guidance enables more accurate and semantically aligned inpainting, particularly in challenging large-mask scenarios where contextual cues are limited. CSF-Net can be seamlessly integrated into various inpainting backbones without any architectural modifications, demonstrating both its generality and practicality for real-world applications. Extensive experiments on the Places365 and COCOA datasets demonstrate that CSF-Net consistently improves performance across multiple inpainting baselines and evaluation metrics.
Qualitative Results
Qualitative comparison of all masking conditions on Places365 and COCOA datasets.
Qualitative results with 80% center box masks on Places365.
Qualitative results with 50% center box masks on Places365.
Qualitative results with 50-80% random masks on Places365.
Quantitative Results
Table 1. Quantitative comparison between state-of-the-art inpainting methods and our CSF-enhanced model (shown with ASUKA integration) at 256 × 256 resolution on Places365. Bold indicates the best performance for each metric.
Table 2. Performance comparison of models with CSF-Net integration on Places365 and COCOA.
BibTeX
@article{heo2025csf,
title={CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting},
author={Heo, Chae-Yeon and Cho, Yeong-Jun},
journal={arXiv preprint arXiv:2511.07987},
year={2025}
}