统计学学术速递[1.10]

格林先生MrGreen arXiv每日学术速递 2022-05-05

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

stat统计学，共计20篇

【1】 AugmentedPCA: A Python Package of Supervised and Adversarial Linear Factor Models
标题：增强的PCA：一个Python包，包含监督和对抗的线性因素模型
链接：https://arxiv.org/abs/2201.02547

作者：William E. Carson IV,Austin Talbot,David Carlson
机构：Department of Biomedical Engineering, Duke University, Durham, NC , Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA , Department of Civil and Environmental Engineering, Department of Biostatistics and Bioinformatics
备注：NeurIPS 2021 (Learning Meaningful Representations of Life Workshop)
摘要：深度自动编码器通常通过监督或对抗性损失进行扩展，以学习具有期望特性的潜在表示，例如标签和结果的更高预测性或敏感变量的公平性。尽管监督和对抗性深层潜在因素模型无处不在，但这些方法应证明比更简单的线性方法在实践中更为可取。这就需要一个可复制的线性模拟，它仍然坚持一个增强的监督或对抗目标。我们通过提出将主成分分析（PCA）目标增加为监督目标或对抗目标的方法，并提供分析和可再现的解决方案，来解决这一方法上的差距。我们在一个开源Python包AugmentedPCA中实现了这些方法，该包可以生成优秀的实际基线。我们在一个开源的RNA-seq癌症基因表达数据集上展示了这些因子模型的实用性，表明通过有监督的客观结果进行增强，可以改善下游分类性能，产生具有更高类别保真度的主成分，并有助于识别与数据方差主轴一致的基因，从而影响特定类型癌症的发展。
摘要：Deep autoencoders are often extended with a supervised or adversarial loss to learn latent representations with desirable properties, such as greater predictivity of labels and outcomes or fairness with respects to a sensitive variable. Despite the ubiquity of supervised and adversarial deep latent factor models, these methods should demonstrate improvement over simpler linear approaches to be preferred in practice. This necessitates a reproducible linear analog that still adheres to an augmenting supervised or adversarial objective. We address this methodological gap by presenting methods that augment the principal component analysis (PCA) objective with either a supervised or an adversarial objective and provide analytic and reproducible solutions. We implement these methods in an open-source Python package, AugmentedPCA, that can produce excellent real-world baselines. We demonstrate the utility of these factor models on an open-source, RNA-seq cancer gene expression dataset, showing that augmenting with a supervised objective results in improved downstream classification performance, produces principal components with greater class fidelity, and facilitates identification of genes aligned with the principal axes of data variance with implications to development of specific types of cancer.

【2】 A Unified Statistical Learning Model for Rankings and Scores with Application to Grant Panel Review
标题：统一的排名和分数统计学习模型及其在资助小组评审中的应用
链接：https://arxiv.org/abs/2201.02539

作者：Michael Pearce,Elena A. Erosheva
机构：University of Washington, Seattle, WA ,-, USA, Department of Statistics, School of Social Work, and the Center for Statistics and the Social Sci-, ences
摘要：排名和分数是两种常见的数据类型，法官使用这两种数据类型来表达偏好和/或对物品集合质量的感知。存在许多模型分别研究每种类型的数据，但没有统一的统计模型在不首先执行数据转换的情况下同时捕获这两种数据类型。我们提出了Mallows二项模型来弥补这一差距，该模型通过量化对象质量、共识排名和法官之间共识水平的共享参数，将Mallows的$\phi$排名模型与二项评分模型相结合。我们提出了一种高效的树搜索算法来计算模型参数的精确MLE，通过分析和模拟研究模型的统计特性，并将我们的模型应用于grant panel review实例中收集分数和部分排名的真实数据。此外，我们还演示了如何使用模型输出对对象进行置信度排序。所提出的模型被证明能够合理地结合来自分数和排名的信息，以量化对象质量，并在适当的统计不确定性水平下衡量共识。
摘要：Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows' $\phi$ ranking model with Binomial score models through shared parameters that quantify object quality, a consensus ranking, and the level of consensus between judges. We propose an efficient tree-search algorithm to calculate the exact MLE of model parameters, study statistical properties of the model both analytically and through simulation, and apply our model to real data from an instance of grant panel review that collected both scores and partial rankings. Furthermore, we demonstrate how model outputs can be used to rank objects with confidence. The proposed model is shown to sensibly combine information from both scores and rankings to quantify object quality and measure consensus with appropriate levels of statistical uncertainty.

【3】 Spatial data modeling by means of Gibbs Markov random fields based on a generalized planar rotator model
标题：基于广义平面旋转体模型的吉布斯马尔可夫随机场空间数据建模
链接：https://arxiv.org/abs/2201.02537

作者：Milan Žukovič,Dionissios T. Hristopulos
机构：School of Electrical and Computer Engineering, Technical University of Crete, Chania , Greece, )
备注：29 pages, 9 figures
摘要：在广义平面旋转器（GPR）模型的基础上，我们引入了笛卡尔网格上空间数据的吉布斯-马尔可夫随机场。探地雷达模型概括了最近提出的改进的平面旋转器（MPR）模型，通过在哈密顿量中加入附加项，更好地捕捉空间数据的真实特征，如平滑度、非高斯性和几何各向异性。特别地，探地雷达模型包括多达无穷多个高阶谐波，具有指数消失的相互作用强度、最近网格邻居之间双线性相互作用项的方向依赖性、较长距离邻居相互作用以及两种类型的外部偏置场。因此，与单参数MPR模型相比，GPR模型具有五个附加参数：高阶项的数量$n$，控制其衰减率的参数$\alpha$，交换各向异性参数$J^{nn}$，进一步的相邻相互作用耦合$J^{fn}$，以及外场（偏差）参数$K$（或$K'$）。我们对各种合成数据进行了数值试验，证明了各项对模型预测性能的影响，并结合数据特性讨论了这些结果。
摘要：We introduce a Gibbs Markov random field for spatial data on Cartesian grids which is based on the generalized planar rotator (GPR) model. The GPR model generalizes the recently proposed modified planar rotator (MPR) model by including in the Hamiltonian additional terms that better capture realistic features of spatial data, such as smoothness, non-Gaussianity, and geometric anisotropy. In particular, the GPR model includes up to infinite number of higher-order harmonics with exponentially vanishing interaction strength, directional dependence of the bilinear interaction term between nearest grid neighbors, longer-distance neighbor interactions, and two types of an external bias field. Hence, in contrast with the single-parameter MPR model, the GPR model features five additional parameters: the number $n$ of higher-order terms and the parameter $\alpha$ controlling their decay rate, the exchange anisotropy parameter $J^{nn}$, the further-neighbor interaction coupling $J^{fn}$, and the external field (bias) parameters $K$ (or $K'$). We present numerical tests on various synthetic data which demonstrate the effects of the respective terms on the model's prediction performance and we discuss these results in connection with the data properties.

【4】 Power and Sample Size Calculations for Rerandomized Experiments
标题：随机试验的功率和样本量计算
链接：https://arxiv.org/abs/2201.02486

作者：Zach Branson,Xinran Li,Peng Ding
机构：Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, Department of Statistics, University of Illinois, Champaign, IL , Department of Statistics, University of California, Berkeley, CA
备注：20 pages, 4 figures
摘要：权力是实验设计的一个重要方面，因为它使研究人员能够了解发现因果效应（如果存在）的机会。通常指定所需的功率级别，然后计算获得该功率级别所需的样本量；因此，功率计算有助于确定在实践中如何进行实验。功率和样本量计算可用于完全随机实验；然而，使用其他实验设计会有很多好处。例如，近年来已经证实，重新随机设计（受试者随机化，直到获得预先指定的协变量平衡水平）提高了因果效应估计器的精度。这项工作建立了再随机处理控制实验的统计能力，从而允许使用样本量计算器。我们的理论结果还阐明了处理效应异质性对功率和样本量的影响，在功率分析中，这一数量常常被忽略。通过模拟，我们证实了我们的理论结果，并发现重新随机化可以导致样本量的大幅减少；e、例如，在许多现实场景中，与完全随机化相比，在固定的功率水平下，重新随机可导致样本量减少25%甚至50%。基于我们结果的功率和样本量计算器在R软件包rerandPower on CRAN中。
摘要：Power is an important aspect of experimental design, because it allows researchers to understand the chance of detecting causal effects if they exist. It is common to specify a desired level of power, and then compute the sample size necessary to obtain that level of power; thus, power calculations help determine how experiments are conducted in practice. Power and sample size calculations are readily available for completely randomized experiments; however, there can be many benefits to using other experimental designs. For example, in recent years it has been established that rerandomized designs, where subjects are randomized until a prespecified level of covariate balance is obtained, increase the precision of causal effect estimators. This work establishes the statistical power of rerandomized treatment-control experiments, thereby allowing for sample size calculators. Our theoretical results also clarify how power and sample size are affected by treatment effect heterogeneity, a quantity that is often ignored in power analyses. Via simulation, we confirm our theoretical results and find that rerandomization can lead to substantial sample size reductions; e.g., in many realistic scenarios, rerandomization can lead to a 25% or even 50% reduction in sample size for a fixed level of power, compared to complete randomization. Power and sample size calculators based on our results are in the R package rerandPower on CRAN.

【5】 Optimality in Noisy Importance Sampling
标题：噪声重要抽样中的最优性
链接：https://arxiv.org/abs/2201.02432

作者：Fernando Llorente,Luca Martino,Jesse Read,David Delgado-Gómez
机构：∗ Universidad Carlos III de Madrid, Leganés, Spain., ⋆ Universidad Rey Juan Carlos, Fuenlabrada, Spain., † École Polytechnique, Palaiseau, France.
摘要：在这项工作中，我们分析了噪声重要性抽样（IS），即，正在对目标密度进行噪声评估。我们提出了一般框架，并推导了噪声IS估计的最优建议密度。最优方案包含噪声实现的方差信息，在噪声功率较高的区域提出点。我们还比较了最优方案的使用与先前在噪声IS框架中考虑的最优方法。
摘要：In this work, we analyze the noisy importance sampling (IS), i.e., IS working with noisy evaluations of the target density. We present the general framework and derive optimal proposal densities for noisy IS estimators. The optimal proposals incorporate the information of the variance of the noisy realizations, proposing points in regions where the noise power is higher. We also compare the use of the optimal proposals with previous optimality approaches considered in a noisy IS framework.

【6】 Measurement Error Models for Spatial Network Lattice Data: Analysis of Car Crashes in Leeds
标题：空间网络点阵数据的测量误差模型--利兹车祸分析
链接：https://arxiv.org/abs/2201.02394

作者：Andrea Gilardi,Riccardo Borgoni,Luca Presicce,Jorge Mateu
摘要：道路伤亡是现代社会，特别是贫穷国家和发展中国家令人担忧的问题。在过去几年中，几位作者开发了复杂的统计方法，以帮助地方当局实施新政策并缓解这一问题。这些模型的开发通常考虑了一组社会经济或人口变量，如人口密度和交通量。然而，他们通常忽略了外部因素可能受到测量误差的影响，这会严重影响统计推断。本文提出了一个贝叶斯层次模型，在考虑空间协变量测量误差的情况下，在网络格层分析车祸发生。建议的方法举例说明了2011年至2019年利兹（英国）道路网中的所有道路碰撞。使用从移动设备获得的大量道路计数在街道段水平上近似计算交通量，并使用测量误差模型校正估计值。我们的结果表明，忽略测量误差会显著恶化模型的拟合，并减弱不精确协变量的影响。
摘要：Road casualties represent an alarming concern for modern societies, especially in poor and developing countries. In the last years, several authors developed sophisticated statistical approaches to help local authorities implement new policies and mitigate the problem. These models are typically developed taking into account a set of socio-economic or demographic variables, such as population density and traffic volumes. However, they usually ignore that the external factors may be suffering from measurement errors, which can severely bias the statistical inference. This paper presents a Bayesian hierarchical model to analyse car crashes occurrences at the network lattice level taking into account measurement error in the spatial covariates. The suggested methodology is exemplified considering all road collisions in the road network of Leeds (UK) from 2011 to 2019. Traffic volumes are approximated at the street segment level using an extensive set of road counts obtained from mobile devices, and the estimates are corrected using a measurement error model. Our results show that omitting measurement error considerably worsens the model's fit and attenuates the effects of imprecise covariates.

【7】 Bayesian Online Change Point Detection for Baseline Shifts
标题：基线偏移的贝叶斯在线变化点检测
链接：https://arxiv.org/abs/2201.02325

作者：Ginga Yoshizawa
机构：Intel K.K., Tokyo,-, Japan
备注：None
摘要：在时间序列数据分析中，实时（在线）检测变化点在金融、环境监测和医学等许多领域都非常重要。实现这一点的一种有希望的方法是贝叶斯在线变化点检测（BOCPD）算法，该算法已成功应用于感兴趣的时间序列具有固定基线的特定情况。然而，我们发现，当基线不可逆地从初始状态偏移时，算法会遇到困难。这是因为使用原始BOCPD算法，如果数据点在距离原始基线相对较远的位置波动，则可检测到变化点的灵敏度会降低。在本文中，我们不仅扩展了原始的BOCPD算法，使其适用于基线不断向未知值移动的时间序列，而且还可视化了所提出的扩展工作的原因。为了证明该算法与原始算法相比的有效性，我们在两个真实数据集和六个合成数据集上检验了这些算法。
摘要：In time series data analysis, detecting change points on a real-time basis (online) is of great interest in many areas, such as finance, environmental monitoring, and medicine. One promising means to achieve this is the Bayesian online change point detection (BOCPD) algorithm, which has been successfully adopted in particular cases in which the time series of interest has a fixed baseline. However, we have found that the algorithm struggles when the baseline irreversibly shifts from its initial state. This is because with the original BOCPD algorithm, the sensitivity with which a change point can be detected is degraded if the data points are fluctuating at locations relatively far from the original baseline. In this paper, we not only extend the original BOCPD algorithm to be applicable to a time series whose baseline is constantly shifting toward unknown values but also visualize why the proposed extension works. To demonstrate the efficacy of the proposed algorithm compared to the original one, we examine these algorithms on two real-world data sets and six synthetic data sets.

【8】 New designs for Bayesian adaptive cluster-randomized trials
标题：贝叶斯自适应整群随机试验的新设计
链接：https://arxiv.org/abs/2201.02301

作者：Junwei Shen,Shirin Golchi,Erica E. M. Moodie,David Benrimoh
机构：Benrimoh, Department of Epidemiology, Biostatistics and Occupational, Health, McGill University., Aifred Health, Montreal, Canada., Department of Psychiatry, McGill University.
摘要：自适应方法，允许更灵活的试验设计，已被提议用于单独随机试验，以节省时间或减少样本量。然而，在分组随机试验中，将参与者组而非个体随机分配到治疗组的适应性设计并不常见。基于旨在评估基于机器学习的临床决策支持系统对治疗抑郁症患者的医生的有效性的集群随机试验，提出了两种用于集群随机试验的贝叶斯自适应设计，以允许在预先计划的中期分析中提前停药。这两种设计的不同之处在于参与者是按顺序招募的。给定试验中允许的最大簇数和最大簇大小，一种设计按给定的最大簇大小依次招募簇，而另一组则在试验开始时招募所有集群，但依次招募个体参与者，直到试验因疗效提前停止或达到最终分析。通过模拟各种场景和两种设计的两种结果类型，探索设计操作特征。仿真结果表明，对于不同的结果，设计选择可能不同。根据模拟结果，我们对贝叶斯自适应聚类随机试验的设计提出了建议。
摘要：Adaptive approaches, allowing for more flexible trial design, have been proposed for individually randomized trials to save time or reduce sample size. However, adaptive designs for cluster-randomized trials in which groups of participants rather than individuals are randomized to treatment arms are less common. Motivated by a cluster-randomized trial designed to assess the effectiveness of a machine-learning based clinical decision support system for physicians treating patients with depression, two Bayesian adaptive designs for cluster-randomized trials are proposed to allow for early stopping for efficacy at pre-planned interim analyses. The difference between the two designs lies in the way that participants are sequentially recruited. Given a maximum number of clusters as well as maximum cluster size allowed in the trial, one design sequentially recruits clusters with the given maximum cluster size, while the other recruits all clusters at the beginning of the trial but sequentially enrolls individual participants until the trial is stopped early for efficacy or the final analysis has been reached. The design operating characteristics are explored via simulations for a variety of scenarios and two outcome types for the two designs. The simulation results show that for different outcomes the design choice may be different. We make recommendations for designs of Bayesian adaptive cluster-randomized trial based on the simulation results.

【9】 A Theoretical Framework of Almost Hyperparameter-free Hyperparameter Selection Methods for Offline Policy Evaluation
标题：用于离线政策评估的几乎无超参数超参数选择方法的理论框架
链接：https://arxiv.org/abs/2201.02300

作者：Kohei Miyaguchi
机构：IBM Research - Tokyo
备注：AAAI22-AI4DO (workshop)
摘要：我们关注离线策略评估（OPE）的超参数选择问题。OPE是离线强化学习的关键组成部分，它是无环境模拟器的数据驱动决策优化的核心技术。然而，目前最先进的OPE方法并不是无超参数的，这破坏了它们在实际应用中的实用性。我们通过为OPE引入一个新的近似超参数选择（AHS）框架来解决这个问题，该框架以定量和可解释的方式定义了一个最优性概念（称为选择标准），没有超参数。然后，我们推导了四种AHS方法，每种方法都具有不同的特性，如收敛速度和时间复杂度。最后，通过初步实验验证了这些方法的有效性和局限性。
摘要：We are concerned with the problem of hyperparameter selection of offline policy evaluation (OPE). OPE is a key component of offline reinforcement learning, which is a core technology for data-driven decision optimization without environment simulators. However, the current state-of-the-art OPE methods are not hyperparameter-free, which undermines their utility in real-life applications. We address this issue by introducing a new approximate hyperparameter selection (AHS) framework for OPE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as convergence rate and time complexity. Finally, we verify effectiveness and limitation of these methods with a preliminary experiment.

【10】 GCWSNet: Generalized Consistent Weighted Sampling for Scalable and Accurate Training of Neural Networks
标题：GCWSNet：可扩展精确训练神经网络的广义一致加权抽样
链接：https://arxiv.org/abs/2201.02283

作者：Ping Li,Weijie Zhao
机构：Cognitive Computing Lab, Baidu Research, NE ,th St. Bellevue, WA , USA
摘要：我们开发了“广义一致加权采样”（GCWS），用于对“powered GMM”（pGMM）内核（带有一个调优参数$p$）进行散列。事实证明，GCWS提供了一个数值稳定的方案，用于对原始数据应用功率变换，而不管$p$的大小和数据。功率变换通常能有效地提高性能，在许多情况下效果显著。我们将散列数据提供给各种公共分类数据集上的神经网络，并将我们的方法命名为“GCWSNet”。我们的大量实验表明，GCWSNet通常可以提高分类精度。此外，从实验中可以明显看出，GCWSNet的收敛速度大大加快。事实上，GCWS通常只需（少于）训练过程的一个历元即可达到合理的精度。这种特性是非常理想的，因为许多应用程序，如广告点击率（CTR）预测模型或数据流（即只看到一次的数据），通常只训练一个历元。另一个有益的副作用是，神经网络第一层的计算变成了加法而不是乘法，因为输入数据变成了二进制（并且非常稀疏）。提供了与（归一化）随机傅里叶特征（NRFF）的经验比较。我们还建议通过计数草图来减小GCWSNet的模型尺寸，并发展了分析使用计数草图对GCWS精度影响的理论。我们的分析表明，“8位”策略应该可以很好地工作，因为我们总是可以在GCWS哈希的输出上应用8位计数草图哈希，而不会对精度造成太大的影响。在训练深度神经网络时，还有许多其他方法可以利用GCWS。例如，可以对最后一层的输出应用GCW，以提高经过训练的深层神经网络的准确性。
摘要：We develop the "generalized consistent weighted sampling" (GCWS) for hashing the "powered-GMM" (pGMM) kernel (with a tuning parameter $p$). It turns out that GCWS provides a numerically stable scheme for applying power transformation on the original data, regardless of the magnitude of $p$ and the data. The power transformation is often effective for boosting the performance, in many cases considerably so. We feed the hashed data to neural networks on a variety of public classification datasets and name our method ``GCWSNet''. Our extensive experiments show that GCWSNet often improves the classification accuracy. Furthermore, it is evident from the experiments that GCWSNet converges substantially faster. In fact, GCWS often reaches a reasonable accuracy with merely (less than) one epoch of the training process. This property is much desired because many applications, such as advertisement click-through rate (CTR) prediction models, or data streams (i.e., data seen only once), often train just one epoch. Another beneficial side effect is that the computations of the first layer of the neural networks become additions instead of multiplications because the input data become binary (and highly sparse). Empirical comparisons with (normalized) random Fourier features (NRFF) are provided. We also propose to reduce the model size of GCWSNet by count-sketch and develop the theory for analyzing the impact of using count-sketch on the accuracy of GCWS. Our analysis shows that an ``8-bit'' strategy should work well in that we can always apply an 8-bit count-sketch hashing on the output of GCWS hashing without hurting the accuracy much. There are many other ways to take advantage of GCWS when training deep neural networks. For example, one can apply GCWS on the outputs of the last layer to boost the accuracy of trained deep neural networks.

【11】 Predictive Criteria for Prior Selection Using Shrinkage in Linear Models
标题：线性模型中使用收缩的先验选择预测准则
链接：https://arxiv.org/abs/2201.02244

作者：Dean Dustin,Bertrand Clarke,Jennifer Clarke
机构：Department of Statistics, University of Nebraska-Lincoln, Hardin Hall North, PO Box , Lincoln, NE
摘要：选择收缩方法可以通过从预先指定的惩罚列表中选择惩罚或基于数据构造惩罚来完成。如果给出了一类线性模型的惩罚列表，我们将在基于数据扰动的预测稳定性准则下，基于样本量和非零参数数量进行比较。这些比较为各种环境下的惩罚选择提供了建议。如果偏好是为给定问题构造定制的惩罚，那么我们提出了一种基于遗传算法的技术，同样使用预测准则。我们发现，一般来说，习惯惩罚的效果不会比任何常用惩罚更差，但在某些情况下，习惯惩罚会降低为可识别的惩罚。由于惩罚选择在数学上等价于先验选择，我们的方法也构造了先验选择。我们提供的技术和建议适用于有限的样本情况。在这种情况下，我们认为扰动下的预测稳定性是在真实模型未知时可以调用的少数相关属性之一。然而，我们在模拟中研究了变量包含，并且作为收缩率选择策略的一部分，我们考虑了oracle属性。特别是，我们看到oracle属性通常适用于满足基本规则性条件的惩罚，因此限制性不足以在惩罚选择中发挥直接作用。此外，我们的真实数据示例还包括合并模型mis规范的注意事项。
摘要：Choosing a shrinkage method can be done by selecting a penalty from a list of pre-specified penalties or by constructing a penalty based on the data. If a list of penalties for a class of linear models is given, we provide comparisons based on sample size and number of non-zero parameters under a predictive stability criterion based on data perturbation. These comparisons provide recommendations for penalty selection in a variety of settings. If the preference is to construct a penalty customized for a given problem, then we propose a technique based on genetic algorithms, again using a predictive criterion. We find that, in general, a custom penalty never performs worse than any commonly used penalties but that there are cases the custom penalty reduces to a recognizable penalty. Since penalty selection is mathematically equivalent to prior selection, our method also constructs priors. The techniques and recommendations we offer are intended for finite sample cases. In this context, we argue that predictive stability under perturbation is one of the few relevant properties that can be invoked when the true model is not known. Nevertheless, we study variable inclusion in simulations and, as part of our shrinkage selection strategy, we include oracle property considerations. In particular, we see that the oracle property typically holds for penalties that satisfy basic regularity conditions and therefore is not restrictive enough to play a direct role in penalty selection. In addition, our real data example also includes considerations merging from model mis-specification.

【12】 Stationary GE-Process and its Application in Analyzing Gold Price Data
标题：平稳GE过程及其在金价数据分析中的应用
链接：https://arxiv.org/abs/2201.02568

作者：Debasis Kundu
机构：Department of Mathematics and Statistics, Indian Institute of Technology Kanpur
备注：26 pages
摘要：本文介绍了一个新的离散时间连续状态空间平稳过程$\{X_n；n=1,2,ldots\}$，使得$$X_n$服从双参数广义指数分布。研究了这一新过程的联合分布函数、特征和一些相关性质。GE过程有三个未知参数，两个形状参数和一个尺度参数，因此它比现有的指数过程更灵活。在存在尺度参数的情况下，如果两个形状参数相等，则可通过求解一个非线性方程获得未知参数的最大似然估计；如果两个形状参数是任意的，则可通过求解二维优化问题获得最大似然估计。对两个{彩色{黑色}合成}数据集和一个真实黄金价格数据集进行了分析，以了解所提出模型在实践中的性能。最后指出了一些概括。
摘要：In this paper we introduce a new discrete time and continuous state space stationary process $\{X_n; n = 1, 2, \ldots \}$, such that $X_n$ follows a two-parameter generalized exponential (GE) distribution. Joint distribution functions, characterization and some dependency properties of this new process have been investigated. The GE-process has three unknown parameters, two shape parameters and one scale parameter, and due to this reason it is more flexible than the existing exponential process. In presence of the scale parameter, if the two shape parameters are equal, then the maximum likelihood estimators of the unknown parameters can be obtained by solving one non-linear equation and if the two shape parameters are arbitrary, then the maximum likelihood estimators can be obtained by solving a two dimensional optimization problem. Two {\color{black} synthetic} data sets, and one real gold-price data set have been analyzed to see the performance of the proposed model in practice. Finally some generalizations have been indicated.

【13】 On robust risk-based active-learning algorithms for enhanced decision support
标题：增强决策支持的基于风险的鲁棒主动学习算法研究
链接：https://arxiv.org/abs/2201.02555

作者：Aidan J. Hughes,Lawrence A. Bull,Paul Gardner,Nikolaos Dervilis,Keith Worden
机构： Department of Mechanical Engineering, University of Sheffield, UKbThe Alan Turing Institute
备注：48 pages, 39 figures, submitted to Mechanical Systems and Signal Processing
摘要：分类模型是物理资产管理技术（如结构健康监测（SHM）系统和数字双胞胎）的基本组成部分。先前的工作介绍了\text{基于风险的主动学习}，这是一种用于开发统计分类器的在线方法，该方法考虑了应用这些分类器的决策支持环境。决策是通过根据\textit{完美信息的期望值}（EVPI）优先查询数据标签来考虑的。虽然通过采用基于风险的主动学习方法获得了一些好处，包括提高了决策性能，但由于引导查询过程，这些算法存在与采样偏差相关的问题。这种抽样偏差最终表现为主动学习后期决策绩效的下降，这反过来又与资源/效用的损失相对应。本文提出了两种新的方法来抵消抽样偏差的影响：\textit{半监督学习}和\textit{判别分类模型}。这些方法首先使用合成数据集进行可视化，然后应用于实验案例研究，特别是Z24桥数据集。半监督学习方法的性能是可变的；对采样偏差的鲁棒性取决于针对每个数据集为模型选择的生成分布的适用性。相比之下，判别分类器对采样偏差的影响具有良好的鲁棒性。此外，还发现，通过仔细选择决策支持监测系统内使用的统计分类，可以减少监测活动期间的视察次数，从而减少资源支出。
摘要：Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins. Previous work introduced \textit{risk-based active learning}, an online approach for the development of statistical classifiers that takes into account the decision-support context in which they are applied. Decision-making is considered by preferentially querying data labels according to \textit{expected value of perfect information} (EVPI). Although several benefits are gained by adopting a risk-based active learning approach, including improved decision-making performance, the algorithms suffer from issues relating to sampling bias as a result of the guided querying process. This sampling bias ultimately manifests as a decline in decision-making performance during the later stages of active learning, which in turn corresponds to lost resource/utility. The current paper proposes two novel approaches to counteract the effects of sampling bias: \textit{semi-supervised learning}, and \textit{discriminative classification models}. These approaches are first visualised using a synthetic dataset, then subsequently applied to an experimental case study, specifically, the Z24 Bridge dataset. The semi-supervised learning approach is shown to have variable performance; with robustness to sampling bias dependent on the suitability of the generative distributions selected for the model with respect to each dataset. In contrast, the discriminative classifiers are shown to have excellent robustness to the effects of sampling bias. Moreover, it was found that the number of inspections made during a monitoring campaign, and therefore resource expenditure, could be reduced with the careful selection of the statistical classifiers used within a decision-supporting monitoring system.

【14】 Dynamic Factor Model for Functional Time Series: Identification, Estimation, and Prediction
标题：函数时间序列的动态因子模型：辨识、估计和预测
链接：https://arxiv.org/abs/2201.02532

作者：Sven Otto,Nazarii Salish
机构：Institute of Finance and Statistics, University of Bonn, Department of Economics, Universidad Carlos III de Madrid
摘要：提出了一种时变函数数据的函数动态因子模型。我们将一个函数时间序列分解为一个由有限个因子组成的低维预测公共分量和一个无预测能力的无限维特质分量。讨论了所有模型参数（包括因子数量）可识别的条件。我们的识别结果导致了一种基于函数主成分的简单易用的两阶段估计方法。作为我们估计过程的一部分，我们解决了公共功能组件和特殊功能组件之间的分离问题。特别是，我们获得了一个一致的信息标准，该标准提供了共同成分的因素数量和动态滞后的联合估计。最后，我们在一个模拟研究中说明了我们的方法的适用性，以及对产量曲线的建模和预测问题的适用性。在一个样本外实验中，我们证明了与广泛使用的期限结构Nelson-Siegel屈服曲线模型相比，我们的模型表现良好。
摘要：A functional dynamic factor model for time-dependent functional data is proposed. We decompose a functional time series into a predictive low-dimensional common component consisting of a finite number of factors and an infinite-dimensional idiosyncratic component that has no predictive power. The conditions under which all model parameters, including the number of factors, become identifiable are discussed. Our identification results lead to a simple-to-use two-stage estimation procedure based on functional principal components. As part of our estimation procedure, we solve the separation problem between the common and idiosyncratic functional components. In particular, we obtain a consistent information criterion that provides joint estimates of the number of factors and dynamic lags of the common component. Finally, we illustrate the applicability of our method in a simulation study and to the problem of modeling and predicting yield curves. In an out-of-sample experiment, we demonstrate that our model performs well compared to the widely used term structure Nelson-Siegel model for yield curves.

【15】 Similarities and Differences between Machine Learning and Traditional Advanced Statistical Modeling in Healthcare Analytics
标题：医疗分析中机器学习与传统高级统计建模的异同
链接：https://arxiv.org/abs/2201.02469

作者：Michele Bennett,Karin Hayes,Ewa J. Kleczyk,Rajesh Mehta
机构： Kleczyk is also an Affiliated Graduate Faculty in the School of Economics atThe University of Maine, and Business Analytics at Grand Canyon University • Competing Interest
备注：16 pages, 2 figures
摘要：数据科学家和统计学家在确定解决分析难题的最佳方法（机器学习或统计建模）时经常会产生分歧。然而，在分析战场的不同方面，机器学习和统计建模是近亲，而不是对手。选择这两种方法或在某些情况下同时使用这两种方法是基于待解决的问题和所需的结果，以及可供使用的数据和分析环境。机器学习和统计建模是互补的，基于相似的数学原理，但只是在总体分析知识库中使用不同的工具。确定主要方法应基于待解决的问题以及经验证据，如数据的大小和完整性、变量数量、假设或缺乏，以及预期结果，如预测或因果关系。优秀的分析师和数据科学家应该精通这两种技术及其正确应用，从而为正确的项目使用正确的工具来实现预期的结果。
摘要：Data scientists and statisticians are often at odds when determining the best approach, machine learning or statistical modeling, to solve an analytics challenge. However, machine learning and statistical modeling are more cousins than adversaries on different sides of an analysis battleground. Choosing between the two approaches or in some cases using both is based on the problem to be solved and outcomes required as well as the data available for use and circumstances of the analysis. Machine learning and statistical modeling are complementary, based on similar mathematical principles, but simply using different tools in an overall analytics knowledge base. Determining the predominant approach should be based on the problem to be solved as well as empirical evidence, such as size and completeness of the data, number of variables, assumptions or lack thereof, and expected outcomes such as predictions or causality. Good analysts and data scientists should be well versed in both techniques and their proper application, thereby using the right tool for the right project to achieve the desired results.

【16】 Applications of Signature Methods to Market Anomaly Detection
标题：签名方法在市场异常检测中的应用
链接：https://arxiv.org/abs/2201.02441

作者：Erdinc Akyildirim,Matteo Gambara,Josef Teichmann,Syang Zhou
机构：Department of Mathematics, ETH, Zurich, Switzerland, Department of Banking and Finance, University of Zurich, Zurich, Switzerland
摘要：异常检测是识别数据集中显著偏离规范的异常实例或事件的过程。在这项研究中，我们提出了一种基于特征码的机器学习算法来检测给定时间序列类型数据集中的罕见或意外项目。我们提出了签名或随机签名作为异常检测算法的特征提取器的应用；此外，我们还为随机签名的构造提供了一个简单的表示理论依据。我们的第一个应用程序基于合成数据，旨在区分真实和虚假的股价轨迹，这些轨迹通过目视检查无法区分。我们还通过使用加密货币市场的交易数据展示了一个真实的应用程序。在这种情况下，通过我们的无监督学习算法，我们能够识别F1分数高达88%的社交网络上组织的抽水和倾倒尝试，从而获得接近基于监督学习领域最先进水平的结果。
摘要：Anomaly detection is the process of identifying abnormal instances or events in data sets which deviate from the norm significantly. In this study, we propose a signatures based machine learning algorithm to detect rare or unexpected items in a given data set of time series type. We present applications of signature or randomized signature as feature extractors for anomaly detection algorithms; additionally we provide an easy, representation theoretic justification for the construction of randomized signatures. Our first application is based on synthetic data and aims at distinguishing between real and fake trajectories of stock prices, which are indistinguishable by visual inspection. We also show a real life application by using transaction data from the cryptocurrency market. In this case, we are able to identify pump and dump attempts organized on social networks with F1 scores up to 88% by means of our unsupervised learning algorithm, thus achieving results that are close to the state-of-the-art in the field based on supervised learning.

【17】 Generalized quantum similarity learning
标题：广义量子相似学习
链接：https://arxiv.org/abs/2201.02310

作者：Santosh Kumar Radha,Casey Jao
机构：Agnostiq Inc., Front St W, Toronto, ON M,V ,Y
摘要：物体之间的相似性在很大范围内都很重要。虽然可以使用现成的距离函数来衡量相似性，但它们可能无法捕捉相似性的内在含义，而这往往取决于基础数据和任务。此外，传统的距离函数限制相似性度量的空间是对称的，并且不允许直接比较来自不同空间的对象。我们建议使用量子网络（GQSim）来学习不需要具有相同维度的数据之间的任务相关（a）对称相似性。我们通过分析（对于简单情况）和数值（对于复杂情况）分析了此类相似性函数的性质，并表明这些相似性度量可以提取数据的显著特征。我们还证明了使用该技术导出的相似性度量是$（\epsilon、\gamma、\tau）$-良好的，从而在理论上保证了性能。最后，我们将此技术应用于三个相关应用——分类、图完成和生成建模。
摘要：The similarity between objects is significant in a broad range of areas. While similarity can be measured using off-the-shelf distance functions, they may fail to capture the inherent meaning of similarity, which tends to depend on the underlying data and task. Moreover, conventional distance functions limit the space of similarity measures to be symmetric and do not directly allow comparing objects from different spaces. We propose using quantum networks (GQSim) for learning task-dependent (a)symmetric similarity between data that need not have the same dimensionality. We analyze the properties of such similarity function analytically (for a simple case) and numerically (for a complex case) and showthat these similarity measures can extract salient features of the data. We also demonstrate that the similarity measure derived using this technique is $(\epsilon,\gamma,\tau)$-good, resulting in theoretically guaranteed performance. Finally, we conclude by applying this technique for three relevant applications - Classification, Graph Completion, Generative modeling.

【18】 Well-Conditioned Linear Minimum Mean Square Error Estimation
标题：良态线性最小均方误差估计
链接：https://arxiv.org/abs/2201.02275

作者：Edwin K. P. Chong
摘要：计算线性最小均方误差（LMMSE）滤波器通常是病态的，这表明无约束的均方误差最小化是滤波器设计的一个不足原则。为了解决这个问题，我们首先开发了一个统一的框架来研究约束LMMSE估计问题。利用这个框架，我们揭示了所有约束LMMSE滤波器的一个重要结构特性，并表明它们都包含一个固有的预处理步骤。这将仅通过预条件器参数化所有此类过滤器。此外，每个滤波器对其预条件子的可逆线性变换是不变的。然后我们阐明，仅仅限制滤波器的秩，导致众所周知的低秩维纳滤波器，并不适合解决病态问题。相反，我们使用一个约束，它明确要求解决方案在特定意义上具有良好的条件。我们引入了两个条件良好的估计量，并评估了它们的均方误差性能。我们证明了这两个估计器在截断功率比收敛到零时收敛到标准LMMSE滤波器，但在标度律方面比低秩维纳滤波器慢。这暴露了条件良好的代价。我们还展示了历史VIX数据的定量结果，以说明我们的两个条件良好的估计量的性能。
摘要：Computing linear minimum mean square error (LMMSE) filters is often ill conditioned, suggesting that unconstrained minimization of the mean square error is an inadequate principle for filter design. To address this, we first develop a unifying framework for studying constrained LMMSE estimation problems. Using this framework, we expose an important structural property of all constrained LMMSE filters and show that they all involve an inherent preconditioning step. This parameterizes all such filters only by their preconditioners. Moreover, each filters is invariant to invertible linear transformations of its preconditioner. We then clarify that merely constraining the rank of the filters, leading to the well-known low-rank Wiener filter, does not suitably address the problem of ill conditioning. Instead, we use a constraint that explicitly requires solutions to be well conditioned in a certain specific sense. We introduce two well-conditioned estimators and evaluate their mean-squared-error performance. We show these two estimators converge to the standard LMMSE filter as their truncated-power ratio converges to zero, but more slowly than the low-rank Wiener filter in terms of scaling law. This exposes the price for being well conditioned. We also show quantitative results with historical VIX data to illustrate the performance of our two well-conditioned estimators.

【19】 The effect of co-location of human communication networks
标题：人类通信网络的共址效应
链接：https://arxiv.org/abs/2201.02230

作者：Daniel Carmody,Martina Mazzarello,Paolo Santi,Trevor Harris,Sune Lehmann,Timur Abbiasov,Robin Dunbar,Carlo Ratti
机构： 3–6 Here we show that lack of researcherco-location during the COVID- 19 lockdown caused the loss of more than 4800weak ties over 18 months in the email network of a large North Americanuniversity – the MIT campus, Texas A&M University
备注：19 pages, 2500 words, 5 figures. Supplementary information included as appendix
摘要：重新连接通信网络的能力对于大规模的人类合作和新思想的传播至关重要。对于知识传播来说，尤其重要的是形成新的弱联系的能力——这种联系在社会系统的遥远部分之间起着桥梁的作用，并促成了新信息的流动。在这里我们展示2019冠状病毒疾病的研究人员缺乏合作导致了在北美一所大型大学的电子邮件网络——MIT校园中的18个月内超过4800条弱联系。此外，我们发现，重新引入部分共定位通过混合工作模式从2021年9月开始导致弱再生的部分再生，特别是在研究人员之间的工作密切。我们通过一个基于物理邻近性的新模型，量化了共同定位对关系更新的影响——我们称之为下生过程——该模型能够重现所有的经验观察结果。结果表明，不在同一地点工作的员工不太可能形成联系，从而削弱了工作场所信息的传播。这些发现有助于更好地理解人类通信网络的时空动态，并有助于正在实施混合工作政策的组织评估健康工作生活所需的最少人际互动。
摘要：The ability to rewire ties in communication networks is vital for large-scale human cooperation and the spread of new ideas. Especially important for knowledge dissemination is the ability to form new weak ties -- ties which act as bridges between distant parts of the social system and enable the flow of novel information. Here we show that lack of researcher co-location during the COVID-19 lockdown caused the loss of more than 4800 weak ties over 18 months in the email network of a large North American university -- the MIT campus. Furthermore, we find that the re-introduction of partial co-location through a hybrid work mode starting in September 2021 led to a partial regeneration of weak ties, especially between researchers who work in close proximity. We quantify the effect of co-location in renewing ties -- a process that we have termed nexogenesis -- through a novel model based on physical proximity, which is able to reproduce all empirical observations. Results highlight that employees who are not co-located are less likely to form ties, weakening the spread of information in the workplace. Such findings could contribute to a better understanding of the spatio-temporal dynamics of human communication networks -- and help organizations that are moving towards the implementation of hybrid work policies to evaluate the minimum amount of in-person interaction necessary for a healthy work life.

【20】 Nonlocal Kernel Network (NKN): a Stable and Resolution-Independent Deep Neural Network
标题：非局部核网络(NKN)：一种稳定的与分辨率无关的深度神经网络
链接：https://arxiv.org/abs/2201.02217

作者：Huaiqian You,Yue Yu,Marta D'Elia,Tian Gao,Stewart Silling
机构：Department of Mathematics, Lehigh University, Bethlehem, PA, Computational Science and Analysis, Sandia National Laboratories, Livermore, CA, IBM Research, Yorktown Heights, NY, Center for Computing Research, Sandia National Laboratories, Albuquerque, NM
摘要：最近，神经算子已成为以神经网络形式设计函数空间之间解映射的流行工具。与经典的科学机器学习方法不同，传统的科学机器学习方法以固定的分辨率为单个输入参数实例学习已知偏微分方程（PDE）的参数，神经算子近似于一系列偏微分方程的解映射。尽管取得了成功，但迄今为止，神经算子的使用仅限于相对较浅的神经网络，并且仅限于学习隐藏的控制规律。在这项工作中，我们提出了一种新的非局部神经算子，我们称之为非局部核网络（nonlocalkernelnetwork，NKN），它与分辨率无关，以深度神经网络为特征，能够处理各种任务，如学习控制方程和分类图像。我们的NKN源于将神经网络解释为一个离散的非局部扩散反应方程，在无限层的限制下，该方程等价于一个抛物型非局部方程，其稳定性通过非局部向量演算进行分析。与神经算子积分形式的相似性使NKN能够捕获特征空间中的长期依赖性，而对节点间相互作用的连续处理使NKN的分辨率独立。与神经常微分方程的相似性，在非局部意义上重新解释，以及层间稳定的网络动力学，允许将NKN的最佳参数从浅网络推广到深网络。这一事实允许使用从浅到深的初始化技术。我们的测试表明，NKNs在学习控制方程和图像分类任务方面都优于基线方法，并且可以很好地推广到不同的分辨率和深度。
摘要：Neural operators have recently become popular tools for designing solution maps between function spaces in the form of neural networks. Differently from classical scientific machine learning approaches that learn parameters of a known partial differential equation (PDE) for a single instance of the input parameters at a fixed resolution, neural operators approximate the solution map of a family of PDEs. Despite their success, the uses of neural operators are so far restricted to relatively shallow neural networks and confined to learning hidden governing laws. In this work, we propose a novel nonlocal neural operator, which we refer to as nonlocal kernel network (NKN), that is resolution independent, characterized by deep neural networks, and capable of handling a variety of tasks such as learning governing equations and classifying images. Our NKN stems from the interpretation of the neural network as a discrete nonlocal diffusion reaction equation that, in the limit of infinite layers, is equivalent to a parabolic nonlocal equation, whose stability is analyzed via nonlocal vector calculus. The resemblance with integral forms of neural operators allows NKNs to capture long-range dependencies in the feature space, while the continuous treatment of node-to-node interactions makes NKNs resolution independent. The resemblance with neural ODEs, reinterpreted in a nonlocal sense, and the stable network dynamics between layers allow for generalization of NKN's optimal parameters from shallow to deep networks. This fact enables the use of shallow-to-deep initialization techniques. Our tests show that NKNs outperform baseline methods in both learning governing equations and image classification tasks and generalize well to different resolutions and depths.

机器翻译，仅供参考

点击“阅读原文”获取带摘要的学术速递

“家属和记者取得联系”：记者的退场意味深长

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

这位副市长，跨省升正厅

女主播性感斗舞，直播间惨遭拿下！知名团播整大活，邀女嘉宾家人做节目

要么空仓！要么盯紧这个！

统计学学术速递[1.10]

您可能也对以下帖子感兴趣

“家属和记者取得联系”：记者的退场意味深长

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

这位副市长，跨省升正厅

女主播性感斗舞，直播间惨遭拿下！知名团播整大活，邀女嘉宾家人做节目

要么空仓！要么盯紧这个！

生成图片，分享到微信朋友圈

统计学学术速递[1.10]

您可能也对以下帖子感兴趣