VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

This directory contains the model and inference code for the CVPR 2023 paper: "VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining" by Junjie Ke, Keren Ye, Jiahui Yu, Yonghui Wu, Peyman Milanfar, and Feng Yang.

TFHub

The VILA-R model is available on TensorFlow Hub for predicting image aesthetic score. See tfhub_inference.ipynb for a sample notebook to try the model.

If you want to go deeper in the code and implementation, follow the instructions below.

Prerequisite

Install dependencies (works with python3.10):

pip3 install -r requirements.txt

The model checkpoints can be downloaded from gcloud directory link

The folder contains the following checkpoints:

./vila/checkpoints/vila_pretrain/: VILA-P, pretrained on AVA-Captions dataset.
./vila/checkpoints/vila_rank_tuned/: VILA-R, finetuned on AVA MOS prediction task using the proposed rank-based adapter module.
./vila/checkpoints/laion_pretrain/: LAION pretrained CoCa model.
./vila/spm_model/: The sentence piece tokenizer used in the models.

Run Inference

Example command for running VILA-R model for aesthetic assessment.

python3 -m vila.run_vila_predict \
  --ckpt_dir=/tmp/vila/checkpoints/vila_rank_tuned/ \
  --image_path=/tmp/image.jpg \
  --spm_model_path=/tmp/vila/spm_model/spm.model

Example command for running VILA model for captioning.

python3 -m vila.run_vila_decode \
  --ckpt_dir=/tmp/vila/checkpoints/vila_rank_tuned/ \
  --image_path=/tmp/image.jpg \
  --spm_model_path=/tmp/vila/spm_model/spm.model

Example command for running LAION pretrained model for captioning.

python3 -m vila.run_vila_decode \
  --is_pretrain \
  --ckpt_dir=/tmp/vila/checkpoints/laion_pretrain/ \
  --image_path=/tmp/image.jpg \
  --spm_model_path=/tmp/vila/spm_model/spm.model

Citation

If you find this code useful for your publication, please cite the original paper:

@inproceedings{ke2023vila,
  title = {VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining},
  author={Ke, Junjie and Ye, Keren and Yu, Jiahui and Wu, Yonghui and Milanfar, Peyman and Yang, Feng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={10041--10051},
  year={2023}
}

google-research

VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

TFHub

Prerequisite

Run Inference

Citation

Использование cookies

google-research

DDDaniel DuckworthAdd demo notebook for SMERF6 месяцев назадf9150d

VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

TFHub

Prerequisite

Run Inference

Citation

Использование cookies