Local 3D Editing via 3D Distillation of CLIP Knowledge

1KAIST AI 2Kakao Enterprise 2Scatter Lab

LENeRF

main figure

Abstract

3D content manipulation is an important computer vision task with many real-world applications (e.g., product design, cartoon generation, and 3D Avatar editing). Recently proposed 3D GANs can generate diverse photorealistic 3D-aware contents using Neural Radiance fields (NeRF). However, manipulation of NeRF still remains a challenging problem since the visual quality tends to degrade after manipulation and suboptimal control handles such as 2D semantic maps are used for manipulations. While text-guided manipulations have shown potential in 3D editing, such approaches often lack locality. To overcome these problems, we propose Local Editing NeRF (LENeRF), which only requires text inputs for fine-grained and localized manipulation. Specifically, we present three add-on modules of LENeRF, the Latent Residual Mapper, the Attention Field Network, and the Deformation Network, which are jointly used for local manipulations of 3D features by estimating a 3D attention field. The 3D attention field is learned in an unsupervised way, by distilling the zero-shot mask generation capability of CLIP to the 3D space with multi-view guidance. We conduct diverse experiments and thorough evaluations both quantitatively and qualitatively.

Video

BibTeX


@inproceedings{hyung2023local,
  title     =   {Local 3D Editing via 3D Distillation of CLIP Knowledge},
  author    =   {Hyung, Junha and Hwang, Sungwon and Kim, Daejin and Lee, Hyunji and Choo, Jaegul},
  booktitle =   {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages     =   {12674--12684},
  year      =   {2023}
}