CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting

Heo, Chae-Yeon; Cho, Yeong-Jun

CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting

Chae-Yeon Heo and Yeong-Jun Cho^*

Chonnam National University, South Korea
WACV 2026
^*Corresponding author

Abstract

In this paper, we propose a semantic-guided framework to address the challenging problem of large-mask image inpainting, where essential visual content is missing and contextual cues are limited. To compensate for the limited context, we leverage a pretrained Amodal Completion (AC) model to generate structure-aware candidates that serve as semantic priors for the missing regions. We introduce Context-Semantic Fusion Network (CSF-Net), a transformer-based fusion framework that fuses these candidates with contextual features to produce a semantic guidance image for image inpainting. This guidance improves inpainting quality by promoting structural accuracy and semantic consistency. CSF-Net can be seamlessly integrated into existing inpainting models without architectural changes and consistently enhances performance across diverse masking conditions. Extensive experiments on the Places365 and COCOA datasets demonstrate that CSF-Net effectively reduces object hallucination while enhancing visual realism and semantic alignment.

Overall Framework

Overview of the proposed CSF-Net. (a) A pretrained amodal completion model generates multiple object completions, and context-inconsistent candidates are filtered out. (b) Dual Swin-Transformer encoders extract multi-scale features from the masked image and selected candidates, which are fused via a cross-attention fusion decoder. (c) Hierarchical pixel selection is performed using structural and perceptual scores to generate the final semantic guidance image.

Key Contributions

First work to leverage amodal completion for semantic guidance in large-mask inpainting, addressing the challenge of limited contextual cues.
Transformer-based fusion framework that integrates contextual features and semantic priors through dual encoders and cross-attention fusion decoder.
Plug-and-play module that seamlessly integrates into existing inpainting models without requiring any architectural modifications.
Consistent performance improvements across diverse masking conditions on Places365 and COCOA datasets.

Technical Details

Semantic Guidance Image Generation

Overview of semantic guidance image generation of CSF-Net. This image incorporates object-level semantic priors and serves as an input to the inpainting model.

Hierarchical Pixel Selection

(a) The Structure Score Network (SSN) and Perceptual Score Network (PSN) compute confidence scores at each scale using fused features and the masked input. Multi-scale consistency is enforced via learnable coefficients β. (b) At the finest scale, the highest-scoring candidate is selected for each pixel to form the semantic guidance image.

Results

We proposed CSF-Net, a transformer-based fusion framework that introduces object-level semantic guidance into the image inpainting process. By leveraging a pretrained amodal completion model, CSF-Net generates multiple structure-aware semantic candidates, which are fused with contextual information to produce a semantic guidance image. This guidance enables more accurate and semantically aligned inpainting, particularly in challenging large-mask scenarios where contextual cues are limited. CSF-Net can be seamlessly integrated into various inpainting backbones without any architectural modifications, demonstrating both its generality and practicality for real-world applications. Extensive experiments on the Places365 and COCOA datasets demonstrate that CSF-Net consistently improves performance across multiple inpainting baselines and evaluation metrics.

Qualitative Results

Qualitative comparison on Places365 and COCOA datasets

Qualitative comparison of all masking conditions on Places365 and COCOA datasets.

Qualitative results with 80% center box masks on Places365.

Qualitative results with 50% center box masks on Places365.

Qualitative results with 50-80% random masks on Places365.

Quantitative Results

Quantitative comparison on Places365 dataset

Table 1. Quantitative comparison between state-of-the-art inpainting methods and our CSF-enhanced model (shown with ASUKA integration) at 256 × 256 resolution on Places365. Bold indicates the best performance for each metric.

Performance comparison with CSF-Net on Places365

Performance comparison with CSF-Net on COCOA

Table 2. Performance comparison of models with CSF-Net integration on Places365 and COCOA.

BibTeX

@InProceedings{Heo_2026_WACV,
    author    = {Heo, Chae-Yeon and Cho, Yeong-Jun},
    title     = {CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {March},
    year      = {2026},
    pages     = {8292-8301}
    }

More Works from Our Lab

Computer Vision Lab in CNU

CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting

Abstract

Overall Framework

Key Contributions

Technical Details

Semantic Guidance Image Generation

Hierarchical Pixel Selection

Results

Qualitative Results

Qualitative comparison of all masking conditions on Places365 and COCOA datasets.

Qualitative results with 80% center box masks on Places365.

Qualitative results with 50% center box masks on Places365.

Qualitative results with 50-80% random masks on Places365.

Quantitative Results

BibTeX