计算机视觉与模式识别学术速递[1.10]

格林先生MrGreen arXiv每日学术速递 2022-05-05

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CV 方向，今日共计36篇

检测相关(6篇)

【1】 Detecting Twenty-thousand Classes using Image-level Supervision
标题：利用图像级监控检测2万个班级
链接：https://arxiv.org/abs/2201.02605

作者：Xingyi Zhou,Rohit Girdha,Armand Joulin,Phillip Krähenbühl,Ishan Misra
机构：Phillip Kr¨ahenb¨uhl, Meta AI, The University of Texas at Austin
备注：Code is available at this https URL
摘要：由于检测数据集的规模较小，当前的目标检测器在词汇量上受到限制。另一方面，由于图像分类器的数据集更大，更容易收集，因此它们的词汇表要大得多。我们提出了Detic，它简单地根据图像分类数据训练检测器的分类器，从而将检测器的词汇表扩展到数万个概念。与之前的工作不同，Detic不会根据模型预测为盒子指定图像标签，这使得它更易于实现，并与一系列检测架构和主干网兼容。我们的结果表明，即使对于没有框注释的类，Detic也能产生优秀的检测器。它在开放词汇表和长尾检测基准上都优于以前的工作。在开放词汇表LVIS基准上，Detic为所有类提供了2.4 mAP的增益，为新类提供了8.3 mAP的增益。在标准LVIS基准测试中，Detic对所有类的mAP达到41.7，对稀有类的mAP达到41.7。这是我们第一次用ImageNet数据集的所有2.1万个类来训练检测器，并表明它无需微调即可推广到新的数据集。代码可在https://github.com/facebookresearch/Detic.
摘要：Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic reaches 41.7 mAP for all classes and 41.7 mAP for rare classes. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without fine-tuning. Code is available at https://github.com/facebookresearch/Detic.

【2】 Equalized Focal Loss for Dense Long-Tailed Object Detection
标题：用于密集长尾目标检测的均衡焦损算法
链接：https://arxiv.org/abs/2201.02593

作者：Bo Li,Yongqiang Yao,Jingru Tan,Gang Zhang,Fengwei Yu,Jianwei Lu,Ye Luo
机构：Tongji University, SenseTime Research, Tsinghua University
摘要：尽管最近长尾目标检测取得了成功，但几乎所有的长尾目标检测器都是基于两阶段模式开发的。实际上，单级探测器在行业中更为普遍，因为它们具有易于部署的简单快速的管道。然而，在长尾方案中，到目前为止尚未探索这一工作领域。在本文中，我们研究了单级探测器在这种情况下是否能够很好地工作。我们发现，阻碍单级检测器获得优异性能的主要障碍是：在长尾数据分布下，类别存在不同程度的正负不平衡问题。传统的焦点损失平衡训练过程，所有类别的调制因子相同，因此无法处理长尾问题。为了解决这个问题，我们提出了均衡焦损（EFL），它根据不同类别的正、负样本的不平衡程度，独立地重新平衡它们的损失贡献。具体来说，EFL采用了一个类别相关的调节因子，可以根据不同类别的训练状态进行动态调整。在具有挑战性的LVIS v1基准上进行的大量实验证明了我们提出的方法的有效性。通过端到端的训练渠道，EFL在总体AP方面达到29.2%，并在稀有类别上获得显著的性能改进，超过了所有现有的最先进方法。该守则可于https://github.com/ModelTC/EOD.
摘要：Despite the recent success of long-tailed object detection, almost all long-tailed object detectors are developed based on the two-stage paradigm. In practice, one-stage detectors are more prevalent in the industry because they have a simple and fast pipeline that is easy to deploy. However, in the long-tailed scenario, this line of work has not been explored so far. In this paper, we investigate whether one-stage detectors can perform well in this case. We discover the primary obstacle that prevents one-stage detectors from achieving excellent performance is: categories suffer from different degrees of positive-negative imbalance problems under the long-tailed data distribution. The conventional focal loss balances the training process with the same modulating factor for all categories, thus failing to handle the long-tailed problem. To address this issue, we propose the Equalized Focal Loss (EFL) that rebalances the loss contribution of positive and negative samples of different categories independently according to their imbalance degrees. Specifically, EFL adopts a category-relevant modulating factor which can be adjusted dynamically by the training status of different categories. Extensive experiments conducted on the challenging LVIS v1 benchmark demonstrate the effectiveness of our proposed method. With an end-to-end training pipeline, EFL achieves 29.2% in terms of overall AP and obtains significant performance improvements on rare categories, surpassing all existing state-of-the-art methods. The code is available at https://github.com/ModelTC/EOD.

【3】 Detecting Human-to-Human-or-Object (H2O) Interactions with DIABOLO
标题：使用空竹检测人与人或物(H2O)的交互
链接：https://arxiv.org/abs/2201.02396

作者：Astrid Orcesi,Romaric Audigier,Fritz Poka Toukam,Bertrand Luvison
机构： Universit´e Paris-Saclay, CEA, List, F-, Palaiseau, France, Vision Lab, ThereSIS, Thales SIX GTS, Campus Polytechnique, Palaiseau, France
备注：ACCEPTED in IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021)
摘要：检测人类互动对于人类行为分析至关重要。已经提出了许多方法来处理人与对象交互（HOI）检测，即在图像中检测人与对象一起交互并对交互类型进行分类。然而，现有的HOI训练数据集中通常不考虑人与人之间的互动，如社会和暴力互动。由于我们认为在分析人类行为时，这些类型的交互不能被忽略，并且与HOI不相关，因此我们提出了一个新的交互数据集来处理这两种类型的人类交互：人与人或对象（H2O）。此外，我们还介绍了一种新的动词分类法，旨在更接近于描述与周围交互目标相关的人体态度，并且更独立于环境。与一些现有的数据集不同，当同义动词的使用高度依赖于目标类型或需要高级语义解释时，我们努力避免定义同义动词。由于H2O数据集包含用这种新分类法注释的V-COCO图像，图像显然包含更多的交互。这可能是HOI检测方法的一个问题，其复杂性取决于人员、目标或交互的数量。因此，我们提出了DIABOLO（通过只看一次来检测交互），这是一种高效的以主题为中心的单镜头方法，可以在一次前向传递中检测所有交互，并且推理时间恒定，与图像内容无关。此外，该多任务网络可同时检测所有人员和对象。我们展示了如何为这些任务共享网络，不仅可以节省计算资源，还可以协同提高性能。最后，DIABOLO是H2O相互作用检测这一新挑战的有力基线，因为在HOI数据集V-COCO上进行训练和评估时，它优于所有最先进的方法。
摘要：Detecting human interactions is crucial for human behavior analysis. Many methods have been proposed to deal with Human-to-Object Interaction (HOI) detection, i.e., detecting in an image which person and object interact together and classifying the type of interaction. However, Human-to-Human Interactions, such as social and violent interactions, are generally not considered in available HOI training datasets. As we think these types of interactions cannot be ignored and decorrelated from HOI when analyzing human behavior, we propose a new interaction dataset to deal with both types of human interactions: Human-to-Human-or-Object (H2O). In addition, we introduce a novel taxonomy of verbs, intended to be closer to a description of human body attitude in relation to the surrounding targets of interaction, and more independent of the environment. Unlike some existing datasets, we strive to avoid defining synonymous verbs when their use highly depends on the target type or requires a high level of semantic interpretation. As H2O dataset includes V-COCO images annotated with this new taxonomy, images obviously contain more interactions. This can be an issue for HOI detection methods whose complexity depends on the number of people, targets or interactions. Thus, we propose DIABOLO (Detecting InterActions By Only Looking Once), an efficient subject-centric single-shot method to detect all interactions in one forward pass, with constant inference time independent of image content. In addition, this multi-task network simultaneously detects all people and objects. We show how sharing a network for these tasks does not only save computation resource but also improves performance collaboratively. Finally, DIABOLO is a strong baseline for the new proposed challenge of H2O Interaction detection, as it outperforms all state-of-the-art methods when trained and evaluated on HOI dataset V-COCO.

【4】 Extending One-Stage Detection with Open-World Proposals
标题：利用开放世界方案扩展一阶段检测
链接：https://arxiv.org/abs/2201.02302

作者：Sachin Konan,Kevin J Liang,Li Yin
机构：Georgia Institute of Technology, Facebook AI
摘要：在许多应用中，例如自动驾驶、手动操作或机器人导航，目标检测方法必须能够检测到训练集中看不见的目标。开放世界检测（OWD）试图通过将检测性能推广到可见和不可见的类别来解决这个问题。最近的工作已经成功地生成了类不可知的建议，我们称之为开放世界建议（OWP），但这是以在检测模型中同时考虑两个任务时，分类任务大幅下降为代价的。这些工作研究了两阶段区域建议网络（RPN），利用对象性评分线索；然而，由于OWP的简单性、运行时间以及定位和分类的解耦性，我们通过完全卷积的一级检测网络（如FCOS）来研究OWP。我们表明，我们在FCO上的架构和采样优化可以将OWP性能在新类上的召回率提高高达6%，这标志着第一个无提案的单阶段检测网络可以实现与基于RPN的两阶段网络相当的性能。此外，我们还证明了FCOS固有的解耦体系结构有利于保持分类性能。虽然两阶段方法在新类的召回率上恶化了6%，但我们表明，当对OWP和分类进行联合优化时，FCOS仅下降了2%。
摘要：In many applications, such as autonomous driving, hand manipulation, or robot navigation, object detection methods must be able to detect objects unseen in the training set. Open World Detection(OWD) seeks to tackle this problem by generalizing detection performance to seen and unseen class categories. Recent works have seen success in the generation of class-agnostic proposals, which we call Open-World Proposals(OWP), but this comes at the cost of a big drop on the classification task when both tasks are considered in the detection model. These works have investigated two-stage Region Proposal Networks (RPN) by taking advantage of objectness scoring cues; however, for its simplicity, run-time, and decoupling of localization and classification, we investigate OWP through the lens of fully convolutional one-stage detection network, such as FCOS. We show that our architectural and sampling optimizations on FCOS can increase OWP performance by as much as 6% in recall on novel classes, marking the first proposal-free one-stage detection network to achieve comparable performance to RPN-based two-stage networks. Furthermore, we show that the inherent, decoupled architecture of FCOS has benefits to retaining classification performance. While two-stage methods worsen by 6% in recall on novel classes, we show that FCOS only drops 2% when jointly optimizing for OWP and classification.

【5】 RestoreDet: Degradation Equivariant Representation for Object Detection in Low Resolution Images
标题：RestoreDet：低分辨率图像目标检测的退化等变表示
链接：https://arxiv.org/abs/2201.02314

作者：Ziteng Cui,Yingying Zhu,Lin Gu,Guo-Jun Qi,Xiaoxiao Li,Peng Gao,Zenghui Zhang,Tatsuya Harada
机构：Shanghai Jiao Tong University,University of Texas at Arlington,RIKEN AIP, The University of Tokyo, Innopeak Technology, The University of British Columbia , Shanghai AI Laboratory
备注：11 pages, 3figures
摘要：超分辨率（SR）等图像恢复算法是退化图像中目标检测不可缺少的预处理模块。然而，大多数算法都假设退化是固定的，并且是先验的。当实际退化未知或与假设不同时，预处理模块和随后的高级任务（如目标检测）都将失败。在这里，我们提出了一种新的框架RestoreDet，用于检测降级低分辨率图像中的对象。RestoreDet利用下采样退化作为自监督信号的一种变换，探索不同分辨率和其他退化条件下的等变表示。具体来说，我们通过对原始图像和随机退化图像的退化变换进行编码和解码来学习这种内在的视觉结构。该框架可以进一步利用高级SR架构和任意分辨率恢复解码器，从降级的输入图像重建原始对应关系。表示学习和目标检测都以端到端的训练方式进行联合优化。RestoreDet是一个通用框架，可以在任何主流对象检测体系结构上实现。大量的实验表明，与现有方法相比，我们基于CenterNet的框架在面对各种退化情况时取得了更好的性能。我们的代码很快就会发布。
摘要：Image restoration algorithms such as super resolution (SR) are indispensable pre-processing modules for object detection in degraded images. However, most of these algorithms assume the degradation is fixed and known a priori. When the real degradation is unknown or differs from assumption, both the pre-processing module and the consequent high-level task such as object detection would fail. Here, we propose a novel framework, RestoreDet, to detect objects in degraded low resolution images. RestoreDet utilizes the downsampling degradation as a kind of transformation for self-supervised signals to explore the equivariant representation against various resolutions and other degradation conditions. Specifically, we learn this intrinsic visual structure by encoding and decoding the degradation transformation from a pair of original and randomly degraded images. The framework could further take the advantage of advanced SR architectures with an arbitrary resolution restoring decoder to reconstruct the original correspondence from the degraded input image. Both the representation learning and object detection are optimized jointly in an end-to-end training fashion. RestoreDet is a generic framework that could be implemented on any mainstream object detection architectures. The extensive experiment shows that our framework based on CenterNet has achieved superior performance compared with existing methods when facing variant degradation situations. Our code would be released soon.

【6】 A Keypoint Detection and Description Network Based on the Vessel Structure for Multi-Modal Retinal Image Registration
标题：一种基于血管结构的多模态视网膜图像配准关键点检测与描述网络
链接：https://arxiv.org/abs/2201.02242

作者：Aline Sindel,Bettina Hohberger,Sebastian Fassihi Dehcordi,Christian Mardin,Robert Lämmer,Andreas Maier,Vincent Christlein
机构：Pattern Recognition Lab, FAU Erlangen-N¨urnberg, Department of Ophthalmology, Universit¨atsklinikum Erlangen
备注：6 pages, 4 figures, 1 table, accepted to BVM 2022
摘要：眼科成像利用不同的成像系统，如彩色眼底、红外、荧光素血管造影、光学相干断层扫描（OCT）或OCT血管造影。对于视网膜疾病的诊断，常常分析具有不同模式或采集时间的多幅图像。通过多模态配准自动对齐图像中的血管结构可以支持眼科医生的工作。我们的方法使用卷积神经网络来提取多模态视网膜图像中血管结构的特征。我们使用分类和交叉模态描述符损失函数在小面片上联合训练关键点检测和描述网络，并在测试阶段将该网络应用于完整图像大小。与竞争方法相比，我们的方法在我们和公共多模式数据集上显示了最佳的注册性能。
摘要：Ophthalmological imaging utilizes different imaging systems, such as color fundus, infrared, fluorescein angiography, optical coherence tomography (OCT) or OCT angiography. Multiple images with different modalities or acquisition times are often analyzed for the diagnosis of retinal diseases. Automatically aligning the vessel structures in the images by means of multi-modal registration can support the ophthalmologists in their work. Our method uses a convolutional neural network to extract features of the vessel structure in multi-modal retinal images. We jointly train a keypoint detection and description network on small patches using a classification and a cross-modal descriptor loss function and apply the network to the full image size in the test phase. Our method demonstrates the best registration performance on our and a public multi-modal dataset in comparison to competing methods.

分类|识别相关(3篇)

【1】 Negative Evidence Matters in Interpretable Histology Image Classification
标题：负证据在可解释组织学图像分类中的作用
链接：https://arxiv.org/abs/2201.02445

作者：Soufiane Belharbi,Marco Pedersoli,Ismail Ben Ayed,Luke McCaffrey,Eric Granger
机构： Dept. of Systems Engineering, ÉTS Montreal, Canada, Goodman Cancer Research Centre, Dept. of Oncology, McGill University, Montreal, Canada
备注：10 figures, under review
摘要：弱监督学习方法仅使用全局注释（如图像类标签），允许CNN分类器对图像进行联合分类，并产生与预测类相关的感兴趣区域。然而，在像素级没有任何指导的情况下，这种方法可能产生不准确的区域。众所周知，与自然图像相比，组织学图像的这一问题更具挑战性，因为对象不太突出，结构有更多变化，前景和背景区域有更强的相似性。因此，计算机视觉文献中用于CNN视觉解释的方法可能不直接适用。在这项工作中，我们提出了一种基于复合损失函数的简单而有效的方法，该方法利用了来自完全负样本的信息。我们的新损失函数包含两个互补项：第一个利用从CNN分类器收集的积极证据，而第二个利用来自训练数据集的完全消极样本。特别是，我们为预先训练的分类器配备了一个解码器，该解码器允许细化感兴趣的区域。利用同一分类器在像素级收集正证据和负证据来训练解码器。这使得能够利用数据中自然出现的完全负采样，而不需要任何额外的监控信号，并且仅使用图像类作为监控。与最近的几种相关方法相比，相对于使用三种不同主干的公众基准结肠癌GlaS和基于Camelyon16贴片的乳腺癌基准，我们展示了我们的方法带来的实质性改进。我们的结果显示了使用消极和积极证据的好处，即从分类器获得的证据和数据集中自然可用的证据。我们提供了两个术语的烧蚀研究。我们的代码是公开的。
摘要：Using only global annotations such as the image class labels, weakly-supervised learning methods allow CNN classifiers to jointly classify an image, and yield the regions of interest associated with the predicted class. However, without any guidance at the pixel level, such methods may yield inaccurate regions. This problem is known to be more challenging with histology images than with natural ones, since objects are less salient, structures have more variations, and foreground and background regions have stronger similarities. Therefore, methods in computer vision literature for visual interpretation of CNNs may not directly apply. In this work, we propose a simple yet efficient method based on a composite loss function that leverages information from the fully negative samples. Our new loss function contains two complementary terms: the first exploits positive evidence collected from the CNN classifier, while the second leverages the fully negative samples from the training dataset. In particular, we equip a pre-trained classifier with a decoder that allows refining the regions of interest. The same classifier is exploited to collect both the positive and negative evidence at the pixel level to train the decoder. This enables to take advantages of the fully negative samples that occurs naturally in the data, without any additional supervision signals and using only the image class as supervision. Compared to several recent related methods, over the public benchmark GlaS for colon cancer and a Camelyon16 patch-based benchmark for breast cancer using three different backbones, we show the substantial improvements introduced by our method. Our results shows the benefits of using both negative and positive evidence, ie, the one obtained from a classifier and the one naturally available in datasets. We provide an ablation study of both terms. Our code is publicly available.

【2】 Persistent Homology for Breast Tumor Classification using Mammogram Scans
标题：使用乳腺X线扫描实现乳腺肿瘤分类的持久同源性
链接：https://arxiv.org/abs/2201.02295

作者：Aras Asaad,Dashti Ali,Taban Majeed,Rasber Rashid
机构：School of Computing, The University of Buckingham, UK., Department of Computer Science, Salahaddin University, Kurdistan Region, Iraq.
备注：10 pages
摘要：拓扑数据分析领域中的一个重要工具是持久同源性（PH），它用于以持久性图（PD）的形式对不同分辨率下的数据同源性的抽象表示进行编码。在这项工作中，我们基于地标选择方法（称为局部二值模式）构建了单个图像的多个PD表示，该方法对图像中不同类型的局部纹理进行编码。我们采用不同的PD矢量化方法，使用持久性景观、持久性图像、持久性组合（Betti曲线）和统计数据。我们在两个公开的乳房异常检测数据集上使用乳房X光扫描测试了基于landmark的PH值的有效性。在这两个数据集中，基于标志物的PH值检测异常乳腺扫描的灵敏度均超过90%。最后，实验结果为使用不同类型的PD矢量化提供了新的见解，这有助于将PH与机器学习分类器结合使用。
摘要：An Important tool in the field topological data analysis is known as persistent Homology (PH) which is used to encode abstract representation of the homology of data at different resolutions in the form of persistence diagram (PD). In this work we build more than one PD representation of a single image based on a landmark selection method, known as local binary patterns, that encode different types of local textures from images. We employed different PD vectorizations using persistence landscapes, persistence images, persistence binning (Betti Curve) and statistics. We tested the effectiveness of proposed landmark based PH on two publicly available breast abnormality detection datasets using mammogram scans. Sensitivity of landmark based PH obtained is over 90% in both datasets for the detection of abnormal breast scans. Finally, experimental results give new insights on using different types of PD vectorizations which help in utilising PH in conjunction with machine learning classifiers.

【3】 3D Intracranial Aneurysm Classification and Segmentation via Unsupervised Dual-branch Learning
标题：基于无监督双分支学习的三维颅内动脉瘤分类与分割
链接：https://arxiv.org/abs/2201.02198

作者：Di Shao,Xuequan Lu,Xiao Liu
机构：Deakin University, Pigdons Rd, Waurn Ponds, Australia
备注：submitted for review (contact: xuequan.lu@deakin.edu.au)
摘要：颅内动脉瘤是当今常见的疾病，如何对其进行智能检测在数字健康领域具有重要意义。现有的深度学习研究大多集中在有监督的医学图像上，我们介绍了一种基于三维点云数据的无监督颅内动脉瘤检测方法。特别是，我们的方法包括两个阶段：无监督的预训练和下游任务。对于前者，主要思想是将每个点云与其抖动的对应点配对，并最大化它们的对应关系。然后，我们设计了一个双分支对比网络，每个分支有一个编码器和一个公共投影头。对于后者，我们设计了用于监督分类和分割训练的简单网络。在公共数据集（IntrA）上的实验表明，我们的无监督方法与一些最先进的监督技术相比，具有相当甚至更好的性能，并且在动脉瘤血管的检测中最为突出。在ModelNet40上的实验也表明，我们的方法达到了90.79%的准确率，优于现有的最先进的无监督模型。
摘要：Intracranial aneurysms are common nowadays and how to detect them intelligently is of great significance in digital health. While most existing deep learning research focused on medical images in a supervised way, we introduce an unsupervised method for the detection of intracranial aneurysms based on 3D point cloud data. In particular, our method consists of two stages: unsupervised pre-training and downstream tasks. As for the former, the main idea is to pair each point cloud with its jittered counterpart and maximise their correspondence. Then we design a dual-branch contrastive network with an encoder for each branch and a subsequent common projection head. As for the latter, we design simple networks for supervised classification and segmentation training. Experiments on the public dataset (IntrA) show that our unsupervised method achieves comparable or even better performance than some state-of-the-art supervised techniques, and it is most prominent in the detection of aneurysmal vessels. Experiments on the ModelNet40 also show that our method achieves the accuracy of 90.79\% which outperforms existing state-of-the-art unsupervised models.

分割|语义相关(5篇)

【1】 Leveraging Scale-Invariance and Uncertainity with Self-Supervised Domain Adaptation for Semantic Segmentation of Foggy Scenes
标题：基于尺度不变性和不确定性的自监督领域自适应模糊场景语义分割
链接：https://arxiv.org/abs/2201.02588

作者：Javed Iqbal,Rehan Hafiz,Mohsen Ali
机构：KEY WORDS: Foggy Scene Understanding, Semantic Segmentation, Self-supervised Learning, Domain Adaptation.
备注：Under Review
摘要：本文提出了一种新的领域自适应语义分割方法FogAdapt。尽管已经有大量的研究致力于减少语义分割中的领域转移，但适应恶劣天气条件下的场景仍然是一个悬而未决的问题。由于天气条件（如雾、烟雾和薄雾）造成的场景可见性的巨大变化加剧了域转移，从而使得在此类场景中进行无监督的自适应具有挑战性。我们提出了一种自熵和多尺度信息增强的自监督域自适应方法（FOGAAPT），以最小化雾场景分割中的域偏移。根据雾密度增加导致分割概率高自熵的经验证据，我们引入了基于自熵的损失函数来指导自适应方法。此外，在不同图像尺度下获得的推论被合并并通过不确定性加权，以生成目标域的尺度不变伪标签。这些比例不变伪标签对可见性和比例变化具有鲁棒性。我们评估了该模型在真实晴朗天气场景下对真实雾场景的适应性，以及在真实雾场景下对合成非雾图像的适应性。我们的实验表明，FogAdapt在模糊图像的语义分割方面明显优于目前的最新技术。具体而言，考虑到与最先进的（SOTA）方法相比的标准设置，当从城市景观调整到多雾的苏黎世时，雾适应在多雾的苏黎世增加3.8%，雾密驾驶增加6.0%，在密欧雾密驾驶增加3.6%。
摘要：This paper presents FogAdapt, a novel approach for domain adaptation of semantic segmentation for dense foggy scenes. Although significant research has been directed to reduce the domain shift in semantic segmentation, adaptation to scenes with adverse weather conditions remains an open question. Large variations in the visibility of the scene due to weather conditions, such as fog, smog, and haze, exacerbate the domain shift, thus making unsupervised adaptation in such scenarios challenging. We propose a self-entropy and multi-scale information augmented self-supervised domain adaptation method (FogAdapt) to minimize the domain shift in foggy scenes segmentation. Supported by the empirical evidence that an increase in fog density results in high self-entropy for segmentation probabilities, we introduce a self-entropy based loss function to guide the adaptation method. Furthermore, inferences obtained at different image scales are combined and weighted by the uncertainty to generate scale-invariant pseudo-labels for the target domain. These scale-invariant pseudo-labels are robust to visibility and scale variations. We evaluate the proposed model on real clear-weather scenes to real foggy scenes adaptation and synthetic non-foggy images to real foggy scenes adaptation scenarios. Our experiments demonstrate that FogAdapt significantly outperforms the current state-of-the-art in semantic segmentation of foggy images. Specifically, by considering the standard settings compared to state-of-the-art (SOTA) methods, FogAdapt gains 3.8% on Foggy Zurich, 6.0% on Foggy Driving-dense, and 3.6% on Foggy Driving in mIoU when adapted from Cityscapes to Foggy Zurich.

【2】 A Novel Incremental Learning Driven Instance Segmentation Framework to Recognize Highly Cluttered Instances of the Contraband Items
标题：一种新的增量学习驱动的实例分割框架，用于识别高杂乱的对比带项目实例
链接：https://arxiv.org/abs/2201.02560

作者：Taimur Hassan,Samet Akcay,Mohammed Bennamoun,Salman Khan,Naoufel Werghi
机构： KhalifaUniversity, Bennamoun is with the Department of Computer Science and SoftwareEngineering, The University of Western Australia
备注：Accepted in IEEE T-SMC: Systems, Source code is available at this https URL
摘要：从行李X光扫描中筛选杂乱无章的违禁物品是一项繁重的任务，即使对于专业的安全人员来说也是如此。本文提出了一种新的策略，该策略扩展了传统的编码器-解码器结构，在不使用任何附加子网络或对象检测器的情况下执行实例感知分割并提取违禁物品的合并实例。编码器-解码器网络首先执行传统的语义分割并检索杂乱的行李项目。然后，该模型在训练期间逐步演化，以使用显著减少的训练批次识别单个实例。为了避免灾难性遗忘，一种新的目标函数通过保留先前获得的知识，同时学习新的类表示，并通过贝叶斯推理解决其复杂的结构相互依赖，从而使每次迭代中的网络损失最小化。在两个公开的X射线数据集上对我们的框架进行的全面评估表明，它优于最先进的方法，特别是在具有挑战性的杂乱场景中，同时在检测精度和效率之间实现了最佳权衡。
摘要：Screening cluttered and occluded contraband items from baggage X-ray scans is a cumbersome task even for the expert security staff. This paper presents a novel strategy that extends a conventional encoder-decoder architecture to perform instance-aware segmentation and extract merged instances of contraband items without using any additional sub-network or an object detector. The encoder-decoder network first performs conventional semantic segmentation and retrieves cluttered baggage items. The model then incrementally evolves during training to recognize individual instances using significantly reduced training batches. To avoid catastrophic forgetting, a novel objective function minimizes the network loss in each iteration by retaining the previously acquired knowledge while learning new class representations and resolving their complex structural inter-dependencies through Bayesian inference. A thorough evaluation of our framework on two publicly available X-ray datasets shows that it outperforms state-of-the-art methods, especially within the challenging cluttered scenarios, while achieving an optimal trade-off between detection accuracy and efficiency.

【3】 CitySurfaces: City-Scale Semantic Segmentation of Sidewalk Materials
标题：CitySurfaces：人行道材质的城市尺度语义分割
链接：https://arxiv.org/abs/2201.02260

作者：Maryam Hosseini,Fabio Miranda,Jianzhe Lin,Claudio Silva
机构：Department of Computer Science and Engineering, New York University, NY, US, Urban Systems, Rutgers University, NJ, US, Department of Computer Science, University of Illinois at Chicago, IL, US, A R T I C L E I N F O, Article history:
备注：Sustainable Cities and Society journal (accepted); Model: this https URL
摘要：在全世界越来越多地提倡设计可持续和有弹性的城市建筑环境的同时，巨大的数据缺口使得对紧迫的可持续性问题的研究具有挑战性。众所周知，路面对经济和环境有很大影响；然而，由于数据收集的成本高且耗时，大多数城市缺乏其表面的空间目录。计算机视觉的最新进展，加上街道级图像的可用性，为城市以更低的实施成本和更高的精度提取大规模建筑环境数据提供了新的机会。在本文中，我们提出了CitySurfaces，这是一个基于主动学习的框架，利用计算机视觉技术，使用广泛可用的街道图像对人行道材料进行分类。我们根据纽约市和波士顿的图像对框架进行了训练，评估结果显示mIoU分数为90.5%。此外，我们使用来自六个不同城市的图像评估了该框架，证明它可以应用于具有不同城市结构的区域，甚至在训练数据领域之外。城市表面可以为研究人员和城市机构提供一种低成本、准确且可扩展的方法来收集人行道材料数据，该方法在解决重大可持续性问题（包括气候变化和地表水管理）方面发挥着关键作用。
摘要：While designing sustainable and resilient urban built environment is increasingly promoted around the world, significant data gaps have made research on pressing sustainability issues challenging to carry out. Pavements are known to have strong economic and environmental impacts; however, most cities lack a spatial catalog of their surfaces due to the cost-prohibitive and time-consuming nature of data collection. Recent advancements in computer vision, together with the availability of street-level images, provide new opportunities for cities to extract large-scale built environment data with lower implementation costs and higher accuracy. In this paper, we propose CitySurfaces, an active learning-based framework that leverages computer vision techniques for classifying sidewalk materials using widely available street-level images. We trained the framework on images from New York City and Boston and the evaluation results show a 90.5% mIoU score. Furthermore, we evaluated the framework using images from six different cities, demonstrating that it can be applied to regions with distinct urban fabrics, even outside the domain of the training data. CitySurfaces can provide researchers and city agencies with a low-cost, accurate, and extensible method to collect sidewalk material data which plays a critical role in addressing major sustainability issues, including climate change and surface water management.

【4】 Effect of Prior-based Losses on Segmentation Performance: A Benchmark
标题：基于先前损失对分割性能的影响：一个基准
链接：https://arxiv.org/abs/2201.02428

作者：Rosana {EL JURDI},Caroline Petitjean,Veronika Cheplygina,Paul Honeine,Fahed Abdallah
机构：Abdallahd, e, Normandie Univ, INSA Rouen, UNIROUEN, UNIHAVRE, LITIS, Rouen, France, Computer Science Department, IT University of Copenhagen, Denmark, Medical Image Analysis group, Eindhoven University of Technology, Eindhoven, The Netherlands
备注：To be submitted to SPIE: Journal of Medical Imaging
摘要：今天，深卷积神经网络（CNN）已经在各种成像模式和任务上展示了最先进的医学图像分割性能。尽管早期取得了成功，分割网络仍可能产生解剖学上异常的分割，在对象边界附近有孔洞或不准确。为了加强解剖学上的合理性，最近的研究集中于将先验知识（如物体形状或边界）纳入损失函数中作为约束条件。先前的整合可以是低级的，指从基本真实分割中提取的重新制定的表示，或者高级的，表示外部医疗信息，例如器官的形状或大小。在过去几年中，基于先前的损失在研究领域表现出越来越高的兴趣，因为它们允许专家知识的集成，同时仍然是架构不可知的。然而，鉴于不同医学成像挑战和任务导致的基于先前的损失的多样性，很难确定哪种损失最适合哪个数据集。在本文中，我们建立了一个基准最近基于先验的损失医学图像分割。主要目的是提供直觉，在给定特定任务或数据集的情况下选择哪些损失。为此，选择了四种低水平和高水平的基于先验的损失。所考虑的损失在来自各种医学图像分割挑战的8个不同数据集上得到验证，包括十项全能、岛屿和WMH挑战。结果表明，尽管无论数据集特征如何，低水平的基于先验的损失都可以保证比骰子损失基线的性能有所提高，但根据数据特征，高水平的基于先验的损失可以提高解剖学上的合理性。
摘要：Today, deep convolutional neural networks (CNNs) have demonstrated state-of-the-art performance for medical image segmentation, on various imaging modalities and tasks. Despite early success, segmentation networks may still generate anatomically aberrant segmentations, with holes or inaccuracies near the object boundaries. To enforce anatomical plausibility, recent research studies have focused on incorporating prior knowledge such as object shape or boundary, as constraints in the loss function. Prior integrated could be low-level referring to reformulated representations extracted from the ground-truth segmentations, or high-level representing external medical information such as the organ's shape or size. Over the past few years, prior-based losses exhibited a rising interest in the research field since they allow integration of expert knowledge while still being architecture-agnostic. However, given the diversity of prior-based losses on different medical imaging challenges and tasks, it has become hard to identify what loss works best for which dataset. In this paper, we establish a benchmark of recent prior-based losses for medical image segmentation. The main objective is to provide intuition onto which losses to choose given a particular task or dataset. To this end, four low-level and high-level prior-based losses are selected. The considered losses are validated on 8 different datasets from a variety of medical image segmentation challenges including the Decathlon, the ISLES and the WMH challenge. Results show that whereas low-level prior-based losses can guarantee an increase in performance over the Dice loss baseline regardless of the dataset characteristics, high-level prior-based losses can increase anatomical plausibility as per data characteristics.

【5】 Cross-Modality Deep Feature Learning for Brain Tumor Segmentation
标题：跨模态深度特征学习在脑肿瘤分割中的应用
链接：https://arxiv.org/abs/2201.02356

作者：Dingwen Zhang,Guohai Huang,Qiang Zhang,Jungong Han,Junwei Han,Yizhou Yu
机构：org (Yizhou Yu) 1School of Mechano-Electronic Engineering, Xidian University, Aberystwyth University, 3School of Automation, Northwestern Polytechnical University
备注：published on Pattern Recognition 2021
摘要：机器学习的最新进展和数字医学图像的普及为利用深度卷积神经网络解决具有挑战性的脑肿瘤分割（BTS）任务提供了机会。然而，与广泛使用的RGB图像数据不同，用于脑肿瘤分割的医学图像数据在数据规模上相对较少，但在形态属性上包含了更丰富的信息。为此，本文提出了一种新的跨模态深度特征学习框架，用于从多模态MRI数据中分割脑肿瘤。其核心思想是在多模态数据中挖掘丰富的模式，以弥补数据规模不足的不足。提出的跨模态深度特征学习框架包括两个学习过程：跨模态特征转换（CMFT）过程和跨模态特征融合（CMFF）过程，其目的是通过在不同模态数据之间传递知识和融合不同模态数据中的知识来学习丰富的特征表示。在BraTS基准上进行了综合实验，结果表明，与基线方法和最新的方法相比，所提出的跨模态深度特征学习框架能够有效地提高脑肿瘤分割的性能。
摘要：Recent advances in machine learning and prevalence of digital medical images have opened up an opportunity to address the challenging brain tumor segmentation (BTS) task by using deep convolutional neural networks. However, different from the RGB image data that are very widespread, the medical image data used in brain tumor segmentation are relatively scarce in terms of the data scale but contain the richer information in terms of the modality property. To this end, this paper proposes a novel cross-modality deep feature learning framework to segment brain tumors from the multi-modality MRI data. The core idea is to mine rich patterns across the multi-modality data to make up for the insufficient data scale. The proposed cross-modality deep feature learning framework consists of two learning processes: the cross-modality feature transition (CMFT) process and the cross-modality feature fusion (CMFF) process, which aims at learning rich feature representations by transiting knowledge across different modality data and fusing knowledge from different modality data, respectively. Comprehensive experiments are conducted on the BraTS benchmarks, which show that the proposed cross-modality deep feature learning framework can effectively improve the brain tumor segmentation performance when compared with the baseline methods and state-of-the-art methods.

Zero/Few Shot|迁移|域适配|自适应(4篇)

【1】 Budget-aware Few-shot Learning via Graph Convolutional Network
标题：基于图卷积网络的预算感知小概率学习
链接：https://arxiv.org/abs/2201.02304

作者：Shipeng Yan,Songyang Zhang,Xuming He
机构：ShanghaiTech University
摘要：本文讨论了少数镜头学习的问题，旨在从几个例子中学习新的视觉概念。少数镜头分类中的一个常见问题是在获取数据标签时采用随机抽样策略，这在实际应用中效率低下。在这项工作中，我们引入了一个新的预算感知Few-Shot学习问题，该问题不仅旨在学习新的对象类别，而且还需要选择信息性示例进行注释，以实现数据效率。我们为预算感知的Few-Shot学习任务开发了一种元学习策略，该策略基于图卷积网络（GCN）和基于示例的Few-Shot分类器联合学习一种新的数据选择策略。我们的选择策略通过图形消息传递计算每个未标记数据的上下文敏感表示，然后用于预测顺序选择的信息性分数。我们通过在mini ImageNet、分层ImageNet和Omniglot数据集上的大量实验来验证我们的方法。结果表明，我们的Few-Shot学习策略比基线学习策略有相当大的优势，这证明了我们方法的有效性。
摘要：This paper tackles the problem of few-shot learning, which aims to learn new visual concepts from a few examples. A common problem setting in few-shot classification assumes random sampling strategy in acquiring data labels, which is inefficient in practical applications. In this work, we introduce a new budget-aware few-shot learning problem that not only aims to learn novel object categories, but also needs to select informative examples to annotate in order to achieve data efficiency. We develop a meta-learning strategy for our budget-aware few-shot learning task, which jointly learns a novel data selection policy based on a Graph Convolutional Network (GCN) and an example-based few-shot classifier. Our selection policy computes a context-sensitive representation for each unlabeled data by graph message passing, which is then used to predict an informativeness score for sequential selection. We validate our method by extensive experiments on the mini-ImageNet, tiered-ImageNet and Omniglot datasets. The results show our few-shot learning strategy outperforms baselines by a sizable margin, which demonstrates the efficacy of our method.

【2】 ITSA: An Information-Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching Networks
标题：ITSA：立体匹配网络中自动回避捷径和区域泛化的信息论方法
链接：https://arxiv.org/abs/2201.02263

作者：WeiQin Chuah,Ruwan Tennakoon,Reza Hoseinnezhad,Alireza Bab-Hadiashar,David Suter
机构：RMIT University, Australia, Edith Cowan University (ECU), Australia†
备注：11 pages, 4 figures
摘要：仅在合成数据上训练的最先进的立体匹配网络通常无法推广到更具挑战性的真实数据域。在本文中，我们试图揭示阻碍网络跨领域推广的一个重要因素：通过快捷学习的视角。我们证明了立体匹配网络中特征表示的学习受到合成数据伪影（快捷属性）的严重影响。为了缓解这个问题，我们提出了一种信息论的快捷方式避免（ITSA）方法来自动限制与快捷方式相关的信息被编码到特征表示中。因此，我们提出的方法通过最小化潜在特征对输入变化的敏感性来学习鲁棒和快捷不变特征。为了避免直接输入灵敏度优化带来的高昂计算成本，我们提出了一种有效可行的算法来实现鲁棒性。我们表明，使用这种方法，纯基于合成数据训练的最先进的立体匹配网络可以有效地推广到具有挑战性和以前看不见的真实数据场景。重要的是，所提出的方法增强了合成训练网络的鲁棒性，使其在挑战域外立体数据集方面优于微调网络（在真实数据上）。
摘要：State-of-the-art stereo matching networks trained only on synthetic data often fail to generalize to more challenging real data domains. In this paper, we attempt to unfold an important factor that hinders the networks from generalizing across domains: through the lens of shortcut learning. We demonstrate that the learning of feature representations in stereo matching networks is heavily influenced by synthetic data artefacts (shortcut attributes). To mitigate this issue, we propose an Information-Theoretic Shortcut Avoidance~(ITSA) approach to automatically restrict shortcut-related information from being encoded into the feature representations. As a result, our proposed method learns robust and shortcut-invariant features by minimizing the sensitivity of latent features to input variations. To avoid the prohibitive computational cost of direct input sensitivity optimization, we propose an effective yet feasible algorithm to achieve robustness. We show that using this method, state-of-the-art stereo matching networks that are trained purely on synthetic data can effectively generalize to challenging and previously unseen real data scenarios. Importantly, the proposed method enhances the robustness of the synthetic trained networks to the point that they outperform their fine-tuned counterparts (on real data) for challenging out-of-domain stereo datasets.

【3】 Deep Domain Adversarial Adaptation for Photon-efficient Imaging Based on Spatiotemporal Inception Network
标题：基于时空初始网络的光子有效成像的深域对抗性自适应
链接：https://arxiv.org/abs/2201.02475

作者：Yiwei Chen,Gongxin Yao,Yong Liu,Yu Pan
机构： College of Control Science and Engineering, Zhejiang University, China
摘要：在单光子激光雷达中，光子高效成像仅通过每个像素几个检测到的信号光子来捕获场景的三维结构。此任务的现有深度学习模型是在模拟数据集上训练的，这在应用于现实场景时会带来领域转移的挑战。在本文中，我们提出了一种用于光子高效成像的时空起始网络（STIN），它能够充分利用时空信息，从稀疏和高噪声光子计数直方图中精确预测深度。然后将领域对抗性自适应框架，包括领域对抗性神经网络和对抗性区分性领域自适应，有效地应用于STIN，以缓解实际应用中的领域转移问题。对NYU~v2和Middlebury数据集生成的模拟数据进行的综合实验表明，在2:10到2:100的低信号背景比下，STIN优于最先进的模型。此外，在由单光子成像原型捕获的真实数据集上的实验结果表明，与现有技术相比，具有域对抗训练的STIN以及通过模拟数据训练的基线STIN具有更好的泛化性能。
摘要：In single-photon LiDAR, photon-efficient imaging captures the 3D structure of a scene by only several detected signal photons per pixel. The existing deep learning models for this task are trained on simulated datasets, which poses the domain shift challenge when applied to realistic scenarios. In this paper, we propose a spatiotemporal inception network (STIN) for photon-efficient imaging, which is able to precisely predict the depth from a sparse and high-noise photon counting histogram by fully exploiting spatial and temporal information. Then the domain adversarial adaptation frameworks, including domain-adversarial neural network and adversarial discriminative domain adaptation, are effectively applied to STIN to alleviate the domain shift problem for realistic applications. Comprehensive experiments on the simulated data generated from the NYU~v2 and the Middlebury datasets demonstrate that STIN outperforms the state-of-the-art models at low signal-to-background ratios from 2:10 to 2:100. Moreover, experimental results on the real-world dataset captured by the single-photon imaging prototype show that the STIN with domain adversarial training achieves better generalization performance compared with the state-of-the-arts as well as the baseline STIN trained by simulated data.

【4】 A three-dimensional dual-domain deep network for high-pitch and sparse helical CT reconstruction
标题：用于高螺距稀疏螺旋CT重建的三维双域深度网络
链接：https://arxiv.org/abs/2201.02309

作者：Wei Wang,Xiang-Gen Xia,Chuanjiang He,Zemin Ren,Jian Lu
机构：Chongqing University
备注：13 pages, 5 figures
摘要：在本文中，我们提出了一种新的GPU实现的Katsevich算法的螺旋CT重建。我们的实现将汉字图分割，并逐节重建CT图像。通过利用Katsevich算法参数的周期性，我们的方法只需对所有音高计算一次这些参数，因此具有较低的GPU内存负担，非常适合深度学习。通过将我们的实现嵌入到网络中，我们提出了一个端到端的深度网络，用于稀疏检测器的高螺距螺旋CT重建。由于我们的网络利用了从正弦图和CT图像中提取的特征，它可以同时减少由正弦图稀疏性引起的条纹伪影，并保留CT图像中的细节。实验表明，该网络在主观和客观评价方面均优于相关方法。
摘要：In this paper, we propose a new GPU implementation of the Katsevich algorithm for helical CT reconstruction. Our implementation divides the sinograms and reconstructs the CT images pitch by pitch. By utilizing the periodic properties of the parameters of the Katsevich algorithm, our method only needs to calculate these parameters once for all the pitches and so has lower GPU-memory burdens and is very suitable for deep learning. By embedding our implementation into the network, we propose an end-to-end deep network for the high pitch helical CT reconstruction with sparse detectors. Since our network utilizes the features extracted from both sinograms and CT images, it can simultaneously reduce the streak artifacts caused by the sparsity of sinograms and preserve fine details in the CT images. Experiments show that our network outperforms the related methods both in subjective and objective evaluations.

半弱无监督|主动学习|不确定性(1篇)

【1】 Uncertainty-Aware Cascaded Dilation Filtering for High-Efficiency Deraining
标题：基于不确定性感知的级联膨胀滤波高效去噪
链接：https://arxiv.org/abs/2201.02366

作者：Qing Guo,Jingyang Sun,Felix Juefei-Xu,Lei Ma,Di Lin,Wei Feng,Song Wang
机构： Lei Ma is withUniversity of Alberta, Di Lin and Wei Feng are with the School of Computer Science andTechnology, Tianjin University
备注：14 pages, 10 figures, 10 tables. This is the extention of our conference version this https URL
摘要：降雨是一项重要的基础计算机视觉任务，旨在消除雨天拍摄的图像或视频中的雨纹和累积。现有的降额方法通常对rain模型进行启发式假设，这迫使它们采用复杂的优化或迭代优化，以获得高的回收质量。然而，这会导致耗时的方法，并影响处理偏离假设的降雨模式的有效性。在本文中，我们提出了一种简单而有效的降额方法，将降额表述为一个预测滤波问题，而无需复杂的降雨模型假设。具体地说，我们识别了空间变异预测滤波（SPFilt），它通过深度网络自适应地预测合适的内核，以过滤不同的单个像素。由于可以通过良好加速的卷积实现滤波，因此我们的方法可以显著提高效率。我们进一步提出了EfDeRain+，它包含三个主要贡献，以在不损害效率的情况下解决剩余降雨痕迹、多尺度和多样的降雨模式。首先，我们提出了不确定性感知的级联预测滤波（UC-PFilt），它可以识别通过预测核重建干净像素的困难，并有效地去除剩余雨迹。其次，我们设计了权重共享多尺度扩展滤波（WS-MS-DFilt）来处理多尺度雨纹，而不影响效率。第三，为了消除不同降雨模式之间的差异，我们提出了一种新的数据增强方法（即RainMix）来训练我们的深层模型。通过将所有贡献与对不同变体的复杂分析相结合，我们的最终方法在恢复质量和速度方面都优于基线方法（在四个单图像数据集和一个视频数据集上）。
摘要：Deraining is a significant and fundamental computer vision task, aiming to remove the rain streaks and accumulations in an image or video captured under a rainy day. Existing deraining methods usually make heuristic assumptions of the rain model, which compels them to employ complex optimization or iterative refinement for high recovery quality. This, however, leads to time-consuming methods and affects the effectiveness for addressing rain patterns deviated from from the assumptions. In this paper, we propose a simple yet efficient deraining method by formulating deraining as a predictive filtering problem without complex rain model assumptions. Specifically, we identify spatially-variant predictive filtering (SPFilt) that adaptively predicts proper kernels via a deep network to filter different individual pixels. Since the filtering can be implemented via well-accelerated convolution, our method can be significantly efficient. We further propose the EfDeRain+ that contains three main contributions to address residual rain traces, multi-scale, and diverse rain patterns without harming the efficiency. First, we propose the uncertainty-aware cascaded predictive filtering (UC-PFilt) that can identify the difficulties of reconstructing clean pixels via predicted kernels and remove the residual rain traces effectively. Second, we design the weight-sharing multi-scale dilated filtering (WS-MS-DFilt) to handle multi-scale rain streaks without harming the efficiency. Third, to eliminate the gap across diverse rain patterns, we propose a novel data augmentation method (i.e., RainMix) to train our deep models. By combining all contributions with sophisticated analysis on different variants, our final method outperforms baseline methods on four single-image deraining datasets and one video deraining dataset in terms of both recovery quality and speed.

时序|行为识别|姿态|视频|运动估计(3篇)

【1】 Sign Language Video Retrieval with Free-Form Textual Queries
标题：基于自由格式文本查询的手语视频检索
链接：https://arxiv.org/abs/2201.02495

作者：Amanda Duarte,Samuel Albanie,Xavier Giró-i-Nieto,Gül Varol
机构：Xavier Gir´o-i-Nieto, G¨ul Varol, Universitat Politecnica de Catalunya, Spain, Barcelona Supercomputing Center, Spain, Machine Intelligence Laboratory, University of Cambridge, UK, Institut de Robotica i Informatica Industrial, CSIC-UPC, Spain
摘要：能够高效搜索手语视频集的系统已经被认为是手语技术的一个有用的应用。然而，在文献中，在单个关键词之外搜索视频的问题受到了有限的关注。为了解决这一差距，在这项工作中，我们介绍了使用自由形式文本查询进行手语检索的任务：给定一个书面查询（例如，一个句子）和大量手语视频集合，目标是在集合中找到与书面查询最匹配的手语视频。我们建议通过在最近引入的美国手语（ASL）大规模How2Sign数据集上学习跨模态嵌入来解决这一问题。我们发现，系统性能的一个关键瓶颈是符号视频嵌入的质量，这是由于缺乏标记的训练数据造成的。因此，我们提出SPOT-ALIGN，这是一个交叉迭代的符号定位和特征对齐框架，以扩大可用训练数据的范围和规模。我们通过改进符号识别和视频检索任务，验证了SPOT-ALIGN学习鲁棒符号视频嵌入的有效性。
摘要：Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with free-form textual queries: given a written query (e.g., a sentence) and a large collection of sign language videos, the objective is to find the signing video in the collection that best matches the written query. We propose to tackle this task by learning cross-modal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labeled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.

【2】 Video Summarization Based on Video-text Representation
标题：基于图文表示的视频摘要
链接：https://arxiv.org/abs/2201.02494

作者：Li Haopeng,Ke Qiuhong,Gong Mingming,Zhang Rui
机构：Video Summarization Based on Video-text RepresentationHaopeng LiUniversity of Melbournehaopeng, auQiuhong KeUniversity of Melbourneqiuhong, auMingming GongUniversity of Melbournemingming, auRui ZhangTsinghua Universityrayteam
摘要：现代视频摘要方法是基于深度神经网络的，需要大量的标注数据进行训练。然而，现有的视频摘要数据集规模较小，容易导致深度模型的过度拟合。考虑到大规模数据集的标注耗时，我们提出了一种多模态自监督学习框架来获取视频的语义表示，这有利于视频摘要任务的完成。具体来说，我们探索视频的视觉信息和文本信息之间的语义一致性，以便在新收集的视频文本对数据集上对多模式编码器进行自我监督预训练。此外，我们还介绍了一种渐进式视频摘要方法，在该方法中，视频中的重要内容将逐步精确定位，以生成更好的摘要。最后，提出了一种基于视频分类的视频摘要质量客观评价框架。大量的实验证明了我们的方法在秩相关系数、F-分数和提出的客观评价方面的有效性和优越性。
摘要：Modern video summarization methods are based on deep neural networks which require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, we explore the semantic consistency between the visual information and text information of videos, for the self-supervised pretraining of a multimodal encoder on a newly-collected dataset of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Finally, an objective evaluation framework is proposed to measure the quality of video summaries based on video classification. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients, F-score, and the proposed objective evaluation compared to the state of the art.

【3】 Auto-Weighted Layer Representation Based View Synthesis Distortion Estimation for 3-D Video Coding
标题：基于自动加权分层表示的三维视频编码视图合成失真估计
链接：https://arxiv.org/abs/2201.02420

作者：Jian Jin,Xingxing Zhang,Lili Meng,Weisi Lin,Jie Liang,Huaxiang Zhang,Yao Zhao
摘要：近年来，为了更好地为三维视频编码服务，人们研究了各种视点合成失真估计模型。然而，它们很难定量地模拟不同深度变化、纹理退化和视图合成失真（VSD）之间的关系，这对于率失真优化和率分配至关重要。本文提出了一种基于自动加权层表示的视图综合失真估计模型。首先，根据深度变化及其相关纹理退化程度定义子VSD（S-VSD）。之后，一组理论推导表明，VSD可以近似分解为S-VSD乘以其相关权重。为了获得S-VSD，开发了一种基于层的S-VSD表示法，其中具有相同深度变化水平的所有像素都用一个层表示，以实现层级的有效S-VSD计算。同时，学习非线性映射函数，精确表示VSD和S-VSD之间的关系，在VSD估计过程中自动为S-VSD提供权重。为了学习这种功能，需要构建VSD数据集及其关联的S-VSD。实验结果表明，一旦相关的S-VSD可用，利用非线性映射函数学习的权值可以准确估计VSD。该方法在准确性和效率上均优于相关的最新方法。建议方法的数据集和源代码将在https://github.com/jianjin008/.
摘要：Recently, various view synthesis distortion estimation models have been studied to better serve for 3-D video coding. However, they can hardly model the relationship quantitatively among different levels of depth changes, texture degeneration, and the view synthesis distortion (VSD), which is crucial for rate-distortion optimization and rate allocation. In this paper, an auto-weighted layer representation based view synthesis distortion estimation model is developed. Firstly, the sub-VSD (S-VSD) is defined according to the level of depth changes and their associated texture degeneration. After that, a set of theoretical derivations demonstrate that the VSD can be approximately decomposed into the S-VSDs multiplied by their associated weights. To obtain the S-VSDs, a layer-based representation of S-VSD is developed, where all the pixels with the same level of depth changes are represented with a layer to enable efficient S-VSD calculation at the layer level. Meanwhile, a nonlinear mapping function is learnt to accurately represent the relationship between the VSD and S-VSDs, automatically providing weights for S-VSDs during the VSD estimation. To learn such function, a dataset of VSD and its associated S-VSDs are built. Experimental results show that the VSD can be accurately estimated with the weights learnt by the nonlinear mapping function once its associated S-VSDs are available. The proposed method outperforms the relevant state-of-the-art methods in both accuracy and efficiency. The dataset and source code of the proposed method will be available at https://github.com/jianjin008/.

医学相关(1篇)

【1】 An Incremental Learning Approach to Automatically Recognize Pulmonary Diseases from the Multi-vendor Chest Radiographs
标题：一种增量学习方法自动识别多厂商胸片中的肺部疾病
链接：https://arxiv.org/abs/2201.02574

作者：Mehreen Sirshar,Taimur Hassan,Muhammad Usman Akram,Shoab Ahmed Khan
机构：Department of Computer and Software Engineering, National University of Sciences and, Technology, Islamabad, Pakistan, Center of Cyber-Physical Systems, Department of Electrical Engineering and Computer Sciences, Khalifa University, Abu
备注：None
摘要：肺部疾病会导致严重的呼吸问题，如果不及时治疗，会导致猝死。许多研究人员利用深度学习系统通过胸部X光（CXRs）诊断肺部疾病。然而，这种系统需要对大规模数据进行详尽的训练，以有效诊断胸部异常。此外，获取如此大规模的数据通常是不可行和不切实际的，特别是对于罕见疾病。随着增量学习的最新进展，研究人员已经周期性地调整深层神经网络来学习不同的分类任务，只需很少的训练示例。虽然这样的系统可以抵抗灾难性遗忘，但它们彼此独立地处理知识表示，这限制了它们的分类性能。此外，据我们所知，没有专门设计用于从CXRs筛查肺部疾病的增量学习驱动图像诊断框架。为了解决这个问题，我们提出了一个新的框架，可以学习逐步筛选不同的胸部异常。除此之外，该框架还通过增量学习损失函数进行惩罚，该函数推断贝叶斯理论，以识别增量学习知识表示之间的结构和语义相互依赖关系，从而有效诊断肺部疾病，而不考虑扫描仪规格。我们在五个包含不同胸部异常的公共CXR数据集上测试了所提出的框架，通过各种指标，它的表现优于各种最先进的系统。
摘要：Pulmonary diseases can cause severe respiratory problems, leading to sudden death if not treated timely. Many researchers have utilized deep learning systems to diagnose pulmonary disorders using chest X-rays (CXRs). However, such systems require exhaustive training efforts on large-scale data to effectively diagnose chest abnormalities. Furthermore, procuring such large-scale data is often infeasible and impractical, especially for rare diseases. With the recent advances in incremental learning, researchers have periodically tuned deep neural networks to learn different classification tasks with few training examples. Although, such systems can resist catastrophic forgetting, they treat the knowledge representations independently of each other, and this limits their classification performance. Also, to the best of our knowledge, there is no incremental learning-driven image diagnostic framework that is specifically designed to screen pulmonary disorders from the CXRs. To address this, we present a novel framework that can learn to screen different chest abnormalities incrementally. In addition to this, the proposed framework is penalized through an incremental learning loss function that infers Bayesian theory to recognize structural and semantic inter-dependencies between incrementally learned knowledge representations to diagnose the pulmonary diseases effectively, regardless of the scanner specifications. We tested the proposed framework on five public CXR datasets containing different chest abnormalities, where it outperformed various state-of-the-art system through various metrics.

GAN|对抗|攻击|生成相关(1篇)

【1】 Deep Generative Framework for Interactive 3D Terrain Authoring and Manipulation
标题：交互式三维地形创作和操纵的深度生成框架
链接：https://arxiv.org/abs/2201.02369

作者：Shanthika Naik,Aryamaan Jain,Avinash Sharma,KS Rajan
机构：IIIT Hyderabad
摘要：虚拟现实模型和游戏等多媒体应用程序最需要的是真实虚拟地形的自动生成和（用户）创作。地形最常用的表示方法是数字高程模型（DEM）。现有的地形创作和建模技术已经解决了其中一些问题，可以大致分为：程序建模、仿真方法和基于示例的方法。在本文中，我们提出了一种新的真实地形创作框架，该框架由VAE和生成性条件GAN模型相结合提供支持。我们的框架是一种基于示例的方法，它试图通过从真实世界的地形数据集中学习潜在空间来克服现有方法的局限性。该潜在空间允许我们从单个输入生成多个地形变量，并在地形之间进行插值，同时使生成的地形接近真实世界的数据分布。我们还开发了一个交互式工具，使用户可以用最少的输入生成不同的地形。我们进行了全面的定性和定量分析，并与其他SOTA方法进行了比较。我们打算向学术界发布我们的代码/工具。
摘要：Automated generation and (user) authoring of the realistic virtual terrain is most sought for by the multimedia applications like VR models and gaming. The most common representation adopted for terrain is Digital Elevation Model (DEM). Existing terrain authoring and modeling techniques have addressed some of these and can be broadly categorized as: procedural modeling, simulation method, and example-based methods. In this paper, we propose a novel realistic terrain authoring framework powered by a combination of VAE and generative conditional GAN model. Our framework is an example-based method that attempts to overcome the limitations of existing methods by learning a latent space from a real-world terrain dataset. This latent space allows us to generate multiple variants of terrain from a single input as well as interpolate between terrains while keeping the generated terrains close to real-world data distribution. We also developed an interactive tool, that lets the user generate diverse terrains with minimalist inputs. We perform thorough qualitative and quantitative analysis and provide comparisons with other SOTA methods. We intend to release our code/tool to the academic community.

人脸|人群计数(1篇)

【1】 A Review of Deep Learning Techniques for Markerless Human Motion on Synthetic Datasets
标题：基于合成数据集的无标记人体运动深度学习技术综述
链接：https://arxiv.org/abs/2201.02503

作者：Doan Duy Vo,Russell Butler
机构：Department of Computer Science, Bishop’s University, Sherbrooke, Quebec, Canada
备注：11 pages, 5 figures, 2 tables
摘要：无标记运动捕获是近年来计算机视觉领域的一个研究热点。它的广泛应用在很多领域，包括计算机动画、人体运动分析、生物医学研究、虚拟现实和体育科学。人体姿态估计近年来在计算机视觉领域得到了越来越多的关注，但由于深度的不确定性和合成数据集的缺乏，这是一项具有挑战性的任务。最近人们提出了各种方法来解决这个问题，其中许多方法都是基于深度学习的。他们主要致力于提高现有基准测试的性能，并取得显著进步，尤其是2D图像。基于强大的深度学习技术和最近收集的真实世界数据集，我们探索了一种仅基于二维图像预测动画骨架的模型。从不同的真实数据集生成的帧，使用从简单到复杂的不同体形合成姿势。实现过程在自己的数据集上使用DeepLabCut来执行许多必要的步骤，然后使用输入帧来训练模型。输出是用于人类运动的动画骨架。复合数据集和其他结果是深度模型的“基本事实”。
摘要：Markerless motion capture has become an active field of research in computer vision in recent years. Its extensive applications are known in a great variety of fields, including computer animation, human motion analysis, biomedical research, virtual reality, and sports science. Estimating human posture has recently gained increasing attention in the computer vision community, but due to the depth of uncertainty and the lack of the synthetic datasets, it is a challenging task. Various approaches have recently been proposed to solve this problem, many of which are based on deep learning. They are primarily focused on improving the performance of existing benchmarks with significant advances, especially 2D images. Based on powerful deep learning techniques and recently collected real-world datasets, we explored a model that can predict the skeleton of an animation based solely on 2D images. Frames generated from different real-world datasets with synthesized poses using different body shapes from simple to complex. The implementation process uses DeepLabCut on its own dataset to perform many necessary steps, then use the input frames to train the model. The output is an animated skeleton for human movement. The composite dataset and other results are the "ground truth" of the deep model.

跟踪(1篇)

【1】 Learning Target-aware Representation for Visual Tracking via Informative Interactions
标题：基于信息交互的视觉跟踪学习目标感知表示
链接：https://arxiv.org/abs/2201.02526

作者：Mingzhe Guo,Zhipeng Zhang,Heng Fan,Liping Jing,Yilin Lyu,Bing Li,Weiming Hu
机构：Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA), Department of Computer Science and Engineering, University of North Texas, Denton, TX USA
备注：9 pages, 6 figures
摘要：我们引入了一种新的主干结构来提高跟踪特征表示的目标感知能力。具体而言，观察到事实上的框架仅使用主干网的输出来执行特征匹配，用于目标定位，匹配模块没有直接反馈到主干网，尤其是浅层。更具体地说，只有匹配模块可以直接访问目标信息（在参考帧中），而候选帧的表示学习对参考目标是盲目的。因此，目标无关干扰在浅层的累积效应可能会降低深层的特征质量。在本文中，我们通过在类似暹罗的主干网（InBN）内进行多分支交互，从不同的角度来处理这个问题。InBN的核心是一个通用交互建模器（GIM），它将参考图像的先验知识注入主干网络的不同阶段，从而在计算成本可以忽略不计的情况下获得更好的目标感知和候选特征表示的鲁棒抗干扰能力。建议的GIM模块和InBN机制是通用的，适用于不同的主干类型，包括CNN和Transformer，以进行改进，我们在多个基准上的大量实验证明了这一点。特别是，CNN版本（基于SiamCAR）在LaSOT/TNL2K上的SUC绝对增益分别为3.2/6.9，改善了基线。Transformer版本获得LUOT/TNL2K的65.7／52的SUC分数，这与最近的艺术状态是一致的。代码和模型将发布。
摘要：We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking. Specifically, having observed that de facto frameworks perform feature matching simply using the outputs from backbone for target localization, there is no direct feedback from the matching module to the backbone network, especially the shallow layers. More concretely, only the matching module can directly access the target information (in the reference frame), while the representation learning of candidate frame is blind to the reference target. As a consequence, the accumulation effect of target-irrelevant interference in the shallow stages may degrade the feature quality of deeper layers. In this paper, we approach the problem from a different angle by conducting multiple branch-wise interactions inside the Siamese-like backbone networks (InBN). At the core of InBN is a general interaction modeler (GIM) that injects the prior knowledge of reference image to different stages of the backbone network, leading to better target-perception and robust distractor-resistance of candidate feature representation with negligible computation cost. The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer for improvements, as evidenced by our extensive experiments on multiple benchmarks. In particular, the CNN version (based on SiamCAR) improves the baseline with 3.2/6.9 absolute gains of SUC on LaSOT/TNL2K, respectively. The Transformer version obtains SUC scores of 65.7/52.0 on LaSOT/TNL2K, which are on par with recent state of the arts. Code and models will be released.

视觉解释|视频理解VQA|caption等(1篇)

【1】 Repurposing Existing Deep Networks for Caption and Aesthetic-Guided Image Cropping
标题：重新利用现有的深层网络进行字幕和美学引导的图像裁剪
链接：https://arxiv.org/abs/2201.02280

作者：Nora Horanyi,Kedi Xia,Kwang Moo Yi,Abhishake Kumar Bojja,Ales Leonardis,Hyung Jin Chang
机构：University of Birmingham, United Kingdom, Zhejiang University, China, University of Victoria, Canada
备注：None
摘要：我们提出了一个新的优化框架，该框架基于用户描述和美学裁剪给定的图像。与现有的图像裁剪方法不同，我们通常训练深度网络回归到裁剪参数或裁剪操作，我们建议通过在图像字幕和美学任务上重新利用预先训练的网络来直接优化裁剪参数，而无需任何微调，从而避免训练单独的网络。具体而言，我们搜索最佳作物参数，以最小化这些网络初始目标的综合损失。为了制作优化表，我们提出了三种策略：（i）多尺度双线性采样，（ii）对裁剪区域的尺度进行退火，从而有效地减少参数空间，（iii）聚合多个优化结果。通过各种定量和定性评估，我们表明，我们的框架可以生产出与预期用户描述一致且美观的作物。
摘要：We propose a novel optimization framework that crops a given image based on user description and aesthetics. Unlike existing image cropping methods, where one typically trains a deep network to regress to crop parameters or cropping actions, we propose to directly optimize for the cropping parameters by repurposing pre-trained networks on image captioning and aesthetic tasks, without any fine-tuning, thereby avoiding training a separate network. Specifically, we search for the best crop parameters that minimize a combined loss of the initial objectives of these networks. To make the optimization table, we propose three strategies: (i) multi-scale bilinear sampling, (ii) annealing the scale of the crop region, therefore effectively reducing the parameter space, (iii) aggregation of multiple optimization results. Through various quantitative and qualitative evaluations, we show that our framework can produce crops that are well-aligned to intended user descriptions and aesthetically pleasing.

3D|3D重建等相关(1篇)

【1】 De-rendering 3D Objects in the Wild
标题：在野外取消渲染3D对象
链接：https://arxiv.org/abs/2201.02279

作者：Felix Wimbauer,Shangzhe Wu,Christian Rupprecht
机构：Visual Geometry Group, University of Oxford
摘要：随着对增强和虚拟现实应用程序（XR）的日益关注，对能够将图像和视频中的对象提升为适用于各种相关3D任务的表示的算法的需求也随之增加。XR设备和应用程序的大规模部署意味着我们不能仅仅依靠监督学习，因为在现实世界中为各种各样的对象收集和注释数据是不可行的。我们提出了一种弱监督方法，能够将单个对象图像分解为形状（深度和法线）、材质（反照率、反射率和光泽度）和全局照明参数。对于训练，该方法仅依赖于训练对象的粗略初始形状估计来引导学习过程。例如，这种形状监督可以来自预训练的深度网络，或者更一般地说，来自运动管道的传统结构。在我们的实验中，我们证明了该方法可以成功地将二维图像去渲染为分解的三维表示，并推广到看不见的对象类别。由于缺乏地面真实数据，野外评估很困难，因此我们还引入了一个照片真实合成测试集，用于定量评估。
摘要：With increasing focus on augmented and virtual reality applications (XR) comes the demand for algorithms that can lift objects from images and videos into representations that are suitable for a wide variety of related 3D tasks. Large-scale deployment of XR devices and applications means that we cannot solely rely on supervised learning, as collecting and annotating data for the unlimited variety of objects in the real world is infeasible. We present a weakly supervised method that is able to decompose a single image of an object into shape (depth and normals), material (albedo, reflectivity and shininess) and global lighting parameters. For training, the method only relies on a rough initial shape estimate of the training objects to bootstrap the learning process. This shape supervision can come for example from a pretrained depth network or - more generically - from a traditional structure-from-motion pipeline. In our experiments, we show that the method can successfully de-render 2D images into a decomposed 3D representation and generalizes to unseen object categories. Since in-the-wild evaluation is difficult due to the lack of ground truth data, we also introduce a photo-realistic synthetic test set that allows for quantitative evaluation.

其他神经网络|深度学习|模型|建模(4篇)

【1】 Embodied Hands: Modeling and Capturing Hands and Bodies Together
标题：具体化的手：一起建模和捕捉手和身体
链接：https://arxiv.org/abs/2201.02610

作者：Javier Romero,Dimitrios Tzionas,Michael J. Black
机构： Max Planck Institute for Intelligent SystemsMICHAEL J, Max Planck Institute for Intelligent SystemsFig
备注：None
摘要：人类将手和身体一起移动来交流和解决任务。捕捉和复制这种协调的活动对于行为真实的虚拟角色来说至关重要。令人惊讶的是，大多数方法分别处理身体和手的三维建模和跟踪。在这里，我们建立了一个手和身体相互作用的模型，并将其应用于全身4D序列。在3D中扫描或捕捉全身时，手很小，通常部分被遮挡，使其形状和姿势难以恢复。为了应对低分辨率、遮挡和噪声，我们开发了一种新的模型，称为MANO（带有关节和非刚性变形的手模型）。MANO是通过对31名受试者的手进行1000次左右的高分辨率3D扫描，在各种各样的手部姿势中学习到的。该模型真实、低维，捕捉非刚性形状随姿势的变化，与标准图形包兼容，可以适合任何人手。MANO提供了从手姿势到姿势混合变形校正的紧凑映射，以及姿势协同作用的线性流形。我们将MANO附加到标准的参数化三维身体形状模型（SMPL），从而生成一个完全铰接的身体和手模型（SMPL+H）。我们通过拟合4D扫描仪捕获的受试者复杂、自然的活动来说明SMPL+H。该装置是全自动的，可使全身模型自然移动，具有详细的手部动作和以前在全身性能捕捉中从未见过的真实感。模型和数据可在我们的网站上免费获取，用于研究目的(http://mano.is.tue.mpg.de).
摘要：Humans move their hands and bodies together to communicate and solve tasks. Capturing and replicating such coordinated activity is critical for virtual characters that behave realistically. Surprisingly, most methods treat the 3D modeling and tracking of bodies and hands separately. Here we formulate a model of hands and bodies interacting together and fit it to full-body 4D sequences. When scanning or capturing the full body in 3D, hands are small and often partially occluded, making their shape and pose hard to recover. To cope with low-resolution, occlusion, and noise, we develop a new model called MANO (hand Model with Articulated and Non-rigid defOrmations). MANO is learned from around 1000 high-resolution 3D scans of hands of 31 subjects in a wide variety of hand poses. The model is realistic, low-dimensional, captures non-rigid shape changes with pose, is compatible with standard graphics packages, and can fit any human hand. MANO provides a compact mapping from hand poses to pose blend shape corrections and a linear manifold of pose synergies. We attach MANO to a standard parameterized 3D body shape model (SMPL), resulting in a fully articulated body and hand model (SMPL+H). We illustrate SMPL+H by fitting complex, natural, activities of subjects captured with a 4D scanner. The fitting is fully automatic and results in full body models that move naturally with detailed hand motions and a realism not seen before in full body performance capture. The models and data are freely available for research purposes in our website (http://mano.is.tue.mpg.de).

【2】 Bayesian Neural Networks for Reversible Steganography
标题：用于可逆隐写的贝叶斯神经网络
链接：https://arxiv.org/abs/2201.02478

作者：Ching-Chun Chang
机构： Chang is with the University of Warwick
摘要：深度学习的最新进展导致可逆隐写术的范式转变。可逆隐写术的一个基本支柱是可通过深层神经网络实现的预测建模。然而，对于一些分布不均和有噪声的数据，在推理中存在非平凡误差。鉴于此问题，我们建议考虑基于贝叶斯深度学习的理论框架的预测模型中的不确定性。贝叶斯神经网络可以看作是一种自感知的机器；也就是说，一台了解自身局限性的机器。为了量化不确定性，我们通过随机正向传递的蒙特卡罗抽样来近似后验预测分布。我们进一步表明，预测不确定性可以分解为任意不确定性和认知不确定性，并且这些数量可以在无监督的方式下学习。实验结果表明，贝叶斯不确定性分析提高了隐写容量失真性能。
摘要：Recent advances in deep learning have led to a paradigm shift in reversible steganography. A fundamental pillar of reversible steganography is predictive modelling which can be realised via deep neural networks. However, non-trivial errors exist in inferences about some out-of-distribution and noisy data. In view of this issue, we propose to consider uncertainty in predictive models based upon a theoretical framework of Bayesian deep learning. Bayesian neural networks can be regarded as self-aware machinery; that is, a machine that knows its own limitations. To quantify uncertainty, we approximate the posterior predictive distribution through Monte Carlo sampling with stochastic forward passes. We further show that predictive uncertainty can be disentangled into aleatoric and epistemic uncertainties and these quantities can be learnt in an unsupervised manner. Experimental results demonstrate an improvement delivered by Bayesian uncertainty analysis upon steganographic capacity-distortion performance.

【3】 Motion Prediction via Joint Dependency Modeling in Phase Space
标题：基于相空间联合依赖建模的运动预测
链接：https://arxiv.org/abs/2201.02365

作者：Pengxiang Su,Zhenguang Liu,Shuang Wu,Lei Zhu,Yifang Yin,Xuanjing Shen
机构：Jilin University, Changchun, Jilin, China, Zhejiang University, Hangzhou, Zhejiang, China, Nanyang Technological University, Shandong Normal Unversity, Jinan, Shandong, China, National University of Singapore
摘要：运动预测是计算机视觉中的一个经典问题，其目的是在给定观测姿态序列的情况下预测未来的运动。已经提出了各种深度学习模型，在运动预测方面实现了最先进的性能。然而，现有方法通常侧重于在姿势空间中建模时间动力学。不幸的是，人类运动的复杂性和高维性给动态上下文捕获带来了固有的挑战。因此，我们脱离了传统的基于姿势的表示，提出了一种新的方法，采用单个关节的相空间轨迹表示。此外，目前的方法往往只考虑物理连接的关节之间的依赖关系。在本文中，我们引入了一种新的卷积神经模型来有效地利用运动解剖学的明确先验知识，同时捕获关节轨迹动力学的空间和时间信息。然后，我们提出了一个全局优化模块，学习各个关节特征之间的隐式关系。根据经验，我们的方法在大规模3D人体运动基准数据集（即Human3.6M、CMU MoCap）上进行了评估。这些结果表明，我们的方法在基准数据集上设置了新的最新技术。我们的代码将在https://github.com/Pose-Group/TEID.
摘要：Motion prediction is a classic problem in computer vision, which aims at forecasting future motion given the observed pose sequence. Various deep learning models have been proposed, achieving state-of-the-art performance on motion prediction. However, existing methods typically focus on modeling temporal dynamics in the pose space. Unfortunately, the complicated and high dimensionality nature of human motion brings inherent challenges for dynamic context capturing. Therefore, we move away from the conventional pose based representation and present a novel approach employing a phase space trajectory representation of individual joints. Moreover, current methods tend to only consider the dependencies between physically connected joints. In this paper, we introduce a novel convolutional neural model to effectively leverage explicit prior knowledge of motion anatomy, and simultaneously capture both spatial and temporal information of joint trajectory dynamics. We then propose a global optimization module that learns the implicit relationships between individual joint features. Empirically, our method is evaluated on large-scale 3D human motion benchmark datasets (i.e., Human3.6M, CMU MoCap). These results demonstrate that our method sets the new state-of-the-art on the benchmark datasets. Our code will be available at https://github.com/Pose-Group/TEID.

【4】 Multiresolution Fully Convolutional Networks to detect Clouds and Snow through Optical Satellite Images
标题：利用光学卫星图像探测云雪的多分辨率全卷积网络
链接：https://arxiv.org/abs/2201.02350

作者：Debvrat Varshney,Claudio Persello,Prasun Kumar Gupta,Bhaskar Ramachandra Nikam
机构：University of Maryland Baltimore County, United States of America, University of Twente, Netherlands, Indian Institute of Remote Sensing (IIRS), Dehradun, India
摘要：云和雪在可见光和近红外（VNIR）范围内具有相似的光谱特征，因此在高分辨率VNIR图像中难以相互区分。我们通过引入短波红外（SWIR）波段来解决这个问题，在短波红外波段中，云具有高反射性，雪具有吸收性。由于与VNIR相比，SWIR的分辨率通常较低，因此本研究提出了一种多分辨率全卷积神经网络（FCN），可有效检测VNIR图像中的云和雪。我们在深度FCN中融合多分辨率波段，并在更高的VNIR分辨率下执行语义分割。这种基于融合的分类器，以端到端的方式进行训练，在印度乌塔拉汗州采集的Resourcesat-2数据上，获得了94.31%的总体准确率和97.67%的F1分数。这些分数比随机森林分类器高30%，比独立的单分辨率FCN高10%。除了用于云检测外，该研究还强调了卷积神经网络在多传感器融合问题中的潜力。
摘要：Clouds and snow have similar spectral features in the visible and near-infrared (VNIR) range and are thus difficult to distinguish from each other in high resolution VNIR images. We address this issue by introducing a shortwave-infrared (SWIR) band where clouds are highly reflective, and snow is absorptive. As SWIR is typically of a lower resolution compared to VNIR, this study proposes a multiresolution fully convolutional neural network (FCN) that can effectively detect clouds and snow in VNIR images. We fuse the multiresolution bands within a deep FCN and perform semantic segmentation at the higher, VNIR resolution. Such a fusion-based classifier, trained in an end-to-end manner, achieved 94.31% overall accuracy and an F1 score of 97.67% for clouds on Resourcesat-2 data captured over the state of Uttarakhand, India. These scores were found to be 30% higher than a Random Forest classifier, and 10% higher than a standalone single-resolution FCN. Apart from being useful for cloud detection purposes, the study also highlights the potential of convolutional neural networks for multi-sensor fusion problems.

其他(4篇)

【1】 Generalized Category Discovery
标题：广义范畴发现
链接：https://arxiv.org/abs/2201.02609

作者：Sagar Vaze,Kai Han,Andrea Vedaldi,Andrew Zisserman
机构：⋆Visual Geometry Group, University of Oxford, †The University of Hong Kong, Setting: Generalized Category Discovery, Method
备注：13 pages, 6 figures
摘要：在本文中，我们考虑一个高度通用的图像识别设置，其中，给定一个标签和未标记的图像集，任务是分类未标记的集合中的所有图像。这里，未标记的图像可能来自标记类或新类。现有的识别方法无法处理这种设置，因为它们做出了一些限制性的假设，例如仅来自已知或未知类的未标记实例以及先验已知的未知类的数量。我们解决了更不受约束的设置，将其命名为“广义类别发现”，并挑战所有这些假设。我们首先通过从新的类别发现中获取最先进的算法并使其适应此任务来建立强大的基线。接下来，我们建议在这个开放的环境中使用视觉转换器和对比表征学习。然后，我们介绍了一种简单而有效的半监督$k$均值方法，将未标记的数据自动聚类到可见和不可见的类中，大大优于基线。最后，我们还提出了一种新的方法来估计未标记数据中的类数。我们全面评估了我们在公共数据集上的方法，用于通用对象分类，包括CIFAR10、CIFAR100和ImageNet-100，以及细粒度视觉识别，包括CUB、斯坦福汽车和Herbarium M19，在此新设置上进行基准测试，以促进未来的研究。
摘要：In this paper, we consider a highly general image recognition setting wherein, given a labelled and unlabelled set of images, the task is to categorize all images in the unlabelled set. Here, the unlabelled images may come from labelled classes or from novel ones. Existing recognition methods are not able to deal with this setting, because they make several restrictive assumptions, such as the unlabelled instances only coming from known - or unknown - classes and the number of unknown classes being known a-priori. We address the more unconstrained setting, naming it 'Generalized Category Discovery', and challenge all these assumptions. We first establish strong baselines by taking state-of-the-art algorithms from novel category discovery and adapting them for this task. Next, we propose the use of vision transformers with contrastive representation learning for this open world setting. We then introduce a simple yet effective semi-supervised $k$-means method to cluster the unlabelled data into seen and unseen classes automatically, substantially outperforming the baselines. Finally, we also propose a new approach to estimate the number of classes in the unlabelled data. We thoroughly evaluate our approach on public datasets for generic object classification including CIFAR10, CIFAR100 and ImageNet-100, and for fine-grained visual recognition including CUB, Stanford Cars and Herbarium19, benchmarking on this new setting to foster future research.

【2】 NeROIC: Neural Rendering of Objects from Online Image Collections
标题：NeROIC：在线图像集合中对象的神经绘制
链接：https://arxiv.org/abs/2201.02533

作者：Zhengfei Kuang,Kyle Olszewski,Menglei Chai,Zeng Huang,Panos Achlioptas,Sergey Tulyakov
机构：University of Southern California, Snap Inc.
备注：Project page including code can be found at: this https URL
摘要：我们提出了一种从在线图像采集中获取对象表示的新方法，从具有不同相机、照明和背景的照片中捕获任意对象的高质量几何和材料特性。这使得各种以对象为中心的渲染应用程序，如新颖的视图合成、重新照明和协调的背景合成，能够在野外输入中进行挑战。我们使用扩展神经辐射场的多阶段方法，首先推断表面几何并细化粗略估计的初始相机参数，同时利用粗略的前景对象遮罩来提高训练效率和几何质量。我们还介绍了一种鲁棒的正态估计技术，它在保留关键细节的同时消除了几何噪声的影响。最后，我们提取表面材质属性和环境照明，以球谐函数表示，扩展处理瞬态元素，例如尖锐阴影。这些组件的结合产生了一个高度模块化和高效的对象获取框架。大量的评估和比较证明了我们的方法在获取对渲染应用程序有用的高质量几何体和外观属性方面的优势。
摘要：We present a novel method to acquire object representations from online image collections, capturing high-quality geometry and material properties of arbitrary objects from photographs with varying cameras, illumination, and backgrounds. This enables various object-centric rendering applications such as novel-view synthesis, relighting, and harmonized background composition from challenging in-the-wild input. Using a multi-stage approach extending neural radiance fields, we first infer the surface geometry and refine the coarsely estimated initial camera parameters, while leveraging coarse foreground object masks to improve the training efficiency and geometry quality. We also introduce a robust normal estimation technique which eliminates the effect of geometric noise while retaining crucial details. Lastly, we extract surface material properties and ambient illumination, represented in spherical harmonics with extensions that handle transient elements, e.g. sharp shadows. The union of these components results in a highly modular and efficient object acquisition framework. Extensive evaluations and comparisons demonstrate the advantages of our approach in capturing high-quality geometry and appearance properties useful for rendering applications.

【3】 Consistent Style Transfer
标题：一致的风格传递
链接：https://arxiv.org/abs/2201.02233

作者：Xuan Luo,Zhen Han,Lingkang Yang,Lingling Zhang
机构：Xi’an Jiaotong University, Wuhan University, content style, SANet, MANet, AdaAttN, Ours
备注：10 pages, 11 figures
摘要：最近，人们提出了注意任意风格转移方法来获得细粒度的结果，该方法通过操纵内容和风格特征之间的点式相似性来进行风格化。然而，基于特征点的注意机制忽略了特征多流形分布，其中每个特征流形对应于图像中的一个语义区域。因此，一个统一的内容语义区域由来自不同样式语义区域的高度不同的模式呈现，从而产生与视觉伪影不一致的样式化结果。我们提出了渐进注意流形对齐（PAMA）来缓解这个问题，它反复应用注意操作和空间感知插值。注意操作根据内容特征的空间分布动态地重新排列风格特征。这使得内容和样式流形在要素图上对应。然后，空间感知插值在相应的内容流形和样式流形之间自适应插值，以增加它们的相似性。通过逐渐将内容流形与样式流形对齐，提出的PAMA在避免语义区域不一致的同时实现了最先进的性能。代码可在https://github.com/computer-vision2022/PAMA.
摘要：Recently, attentional arbitrary style transfer methods have been proposed to achieve fine-grained results, which manipulates the point-wise similarity between content and style features for stylization. However, the attention mechanism based on feature points ignores the feature multi-manifold distribution, where each feature manifold corresponds to a semantic region in the image. Consequently, a uniform content semantic region is rendered by highly different patterns from various style semantic regions, producing inconsistent stylization results with visual artifacts. We proposed the progressive attentional manifold alignment (PAMA) to alleviate this problem, which repeatedly applies attention operations and space-aware interpolations. The attention operation rearranges style features dynamically according to the spatial distribution of content features. This makes the content and style manifolds correspond on the feature map. Then the space-aware interpolation adaptively interpolates between the corresponding content and style manifolds to increase their similarity. By gradually aligning the content manifolds to style manifolds, the proposed PAMA achieves state-of-the-art performance while avoiding the inconsistency of semantic regions. Codes are available at https://github.com/computer-vision2022/PAMA.

【4】 Amplitude SAR Imagery Splicing Localization
标题：幅度SAR图像拼接定位
链接：https://arxiv.org/abs/2201.02409

作者：Edoardo Daniele Cannas,Nicolò Bonettini,Sara Mandelli,Paolo Bestagini,Stefano Tubaro
摘要：合成孔径雷达（SAR）图像是执行各种任务的宝贵资产。在过去几年中，许多网站以易于管理的产品形式免费提供这些产品，有利于它们在SAR领域的广泛传播和研究工作。这些机会的缺点是，此类图像可能会被恶意用户伪造和操纵，从而引发对其完整性和可信度的新担忧。到目前为止，多媒体取证文献已经提出了各种技术来定位自然照片中的操作，但从未对SAR图像的完整性评估进行过研究。这项任务带来了新的挑战，因为合成孔径雷达图像的处理链与自然照片的处理链完全不同。这意味着为自然图像开发的许多取证方法不能保证成功。本文研究了幅度SAR图像拼接定位问题。我们的目标是定位振幅SAR图像的区域，这些区域是从另一幅图像复制和粘贴的，在这个过程中可能会进行某种编辑。为此，我们利用卷积神经网络（CNN）提取指纹，突出显示分析输入处理轨迹中的不一致性。然后，我们检查这个指纹，以生成一个二进制篡改掩码，指示受到拼接攻击的像素区域。结果表明，我们提出的方法，根据SAR信号的性质，提供了比为自然图像开发的最先进的法医工具更好的性能。
摘要：Synthetic Aperture Radar (SAR) images are a valuable asset for a wide variety of tasks. In the last few years, many websites have been offering them for free in the form of easy to manage products, favoring their widespread diffusion and research work in the SAR field. The drawback of these opportunities is that such images might be exposed to forgeries and manipulations by malicious users, raising new concerns about their integrity and trustworthiness. Up to now, the multimedia forensics literature has proposed various techniques to localize manipulations in natural photographs, but the integrity assessment of SAR images was never investigated. This task poses new challenges, since SAR images are generated with a processing chain completely different from that of natural photographs. This implies that many forensics methods developed for natural images are not guaranteed to succeed. In this paper, we investigate the problem of amplitude SAR imagery splicing localization. Our goal is to localize regions of an amplitude SAR image that have been copied and pasted from another image, possibly undergoing some kind of editing in the process. To do so, we leverage a Convolutional Neural Network (CNN) to extract a fingerprint highlighting inconsistencies in the processing traces of the analyzed input. Then, we examine this fingerprint to produce a binary tampering mask indicating the pixel region under splicing attack. Results show that our proposed method, tailored to the nature of SAR signals, provides better performances than state-of-the-art forensic tools developed for natural images.

机器翻译，仅供参考

点击“阅读原文”获取带摘要的学术速递

“家属和记者取得联系”：记者的退场意味深长

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

这位副市长，跨省升正厅

女主播性感斗舞，直播间惨遭拿下！知名团播整大活，邀女嘉宾家人做节目

要么空仓！要么盯紧这个！

计算机视觉与模式识别学术速递[1.10]

您可能也对以下帖子感兴趣

“家属和记者取得联系”：记者的退场意味深长

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

这位副市长，跨省升正厅

女主播性感斗舞，直播间惨遭拿下！知名团播整大活，邀女嘉宾家人做节目

要么空仓！要么盯紧这个！

生成图片，分享到微信朋友圈

计算机视觉与模式识别学术速递[1.10]

您可能也对以下帖子感兴趣