1.6 Methods for preprocessing, harmonization, and quality control

Tools and algorithms for preprocessing, harmonizing, denoising, segmenting, and quality controlling neuroimaging and omics data, including cross-site harmonization and batch-effect correction.

Convolutional neural networks for classification of Alzheimer’s disease: Overview and reproducible evaluation.

Numerous machine learning (ML) approaches have been proposed for automatic classification of Alzheimer’s disease (AD) from brain imaging data. In particular, over 30 papers have proposed to use convolutional neural networks (CNN) for AD classification from anatomical MRI. However, the classification performance is difficult to compare across studies due to variations in components such as participant selection, image preprocessing or validation procedure. Moreover, these studies are hardly reproducible because their frameworks are not publicly accessible and because implementation details are lacking. Lastly, some of these papers may report a biased performance due to inadequate or unclear validation or model selection procedures. In the present work, we aim to address these limitations through three main contributions. First, we performed a systematic literature review. We identified four main types of approaches: i) 2D slice-level, ii) 3D patch-level, iii) ROI-based and iv) 3D subject-level CNN. Moreover, we found that more than half of the surveyed papers may have suffered from data leakage and thus reported biased performance. Our second contribution is the extension of our open-source framework for classification of AD using CNN and T1-weighted MRI. The framework comprises previously developed tools to automatically convert ADNI, AIBL and OASIS data into the BIDS standard, and a modular set of image preprocessing procedures, classification architectures and evaluation procedures dedicated to deep learning. Finally, we used this framework to rigorously compare different CNN architectures. The data was split into training/validation/test sets at the very beginning and only the training/validation sets were used for model selection. To avoid any overfitting, the test sets were left untouched until the end of the peer-review process. Overall, the different 3D approaches (3D-subject, 3D-ROI, 3D-patch) achieved similar performances while that of the 2D slice approach was lower. Of note, the different CNN approaches did not perform better than a SVM with voxel-based features. The different approaches generalized well to similar populations but not to datasets with different inclusion criteria or demographical characteristics. All the code of the framework and the experiments is publicly available: general-purpose tools have been integrated into the Clinica software (www.clinica.run) and the paper-specific code is available at: https://github.com/aramis-lab/AD-DL.

URL: [‘https://github.com/aramis-lab/AD-DL’]

Beyond the eye: A relational model for early dementia detection using retinal OCTA images.

Early detection of dementia, such as Alzheimer’s disease (AD) or mild cognitive impairment (MCI), is essential to enable timely intervention and potential treatment. Accurate detection of AD/MCI is challenging due to the high complexity, cost, and often invasive nature of current diagnostic techniques, which limit their suitability for large-scale population screening. Given the shared embryological origins and physiological characteristics of the retina and brain, retinal imaging is emerging as a potentially rapid and cost-effective alternative for the identification of individuals with or at high risk of AD. In this paper, we present a novel PolarNet+ that uses retinal optical coherence tomography angiography (OCTA) to discriminate early-onset AD (EOAD) and MCI subjects from controls. Our method first maps OCTA images from Cartesian coordinates to polar coordinates, allowing approximate sub-region calculation to implement the clinician-friendly early treatment of diabetic retinopathy study (ETDRS) grid analysis. We then introduce a multi-view module to serialize and analyze the images along three dimensions for comprehensive, clinically useful information extraction. Finally, we abstract the sequence embedding into a graph, transforming the detection task into a general graph classification problem. A regional relationship module is applied after the multi-view module to explore the relationship between the sub-regions. Such regional relationship analyses validate known eye-brain links and reveal new discriminative patterns. The proposed model is trained, tested, and validated on four retinal OCTA datasets, including 1,671 participants with AD, MCI, and healthy controls. Experimental results demonstrate the performance of our model in detecting AD and MCI with an AUC of 88.69% and 88.02%, respectively. Our results provide evidence that retinal OCTA imaging, coupled with artificial intelligence, may serve as a rapid and non-invasive approach for large-scale screening of AD and MCI. The code is available at https://github.com/iMED-Lab/PolarNet-Plus-PyTorch, and the dataset is also available upon request.

URL: [‘https://github.com/iMED-Lab/PolarNet-Plus-PyTorch’]

Learning to synthesise the ageing brain without longitudinal data.

How will my face look when I get older? Or, for a more challenging question: How will my brain look when I get older? To answer this question one must devise (and learn from data) a multivariate auto-regressive function which given an image and a desired target age generates an output image. While collecting data for faces may be easier, collecting longitudinal brain data is not trivial. We propose a deep learning-based method that learns to simulate subject-specific brain ageing trajectories without relying on longitudinal data. Our method synthesises images conditioned on two factors: age (a continuous variable), and status of Alzheimer’s Disease (AD, an ordinal variable). With an adversarial formulation we learn the joint distribution of brain appearance, age and AD status, and define reconstruction losses to address the challenging problem of preserving subject identity. We compare with several benchmarks using two widely used datasets. We evaluate the quality and realism of synthesised images using ground-truth longitudinal data and a pre-trained age predictor. We show that, despite the use of cross-sectional data, our model learns patterns of gray matter atrophy in the middle temporal gyrus in patients with AD. To demonstrate generalisation ability, we train on one dataset and evaluate predictions on the other. In conclusion, our model shows an ability to separate age, disease influence and anatomy using only 2D cross-sectional data that should be useful in large studies into neurodegenerative disease, that aim to combine several data sources. To facilitate such future studies by the community at large our code is made available at https://github.com/xiat0616/BrainAgeing.

URL: [‘https://github.com/xiat0616/BrainAgeing’]

Inferring brain causal and temporal-lag networks for recognizing abnormal patterns of dementia.

Brain functional network analysis has become a popular method to explore the laws of brain organization and identify biomarkers of neurological diseases. However, it is still a challenging task to construct an ideal brain network due to the limited understanding of the human brain. Existing methods often ignore the impact of temporal-lag on the results of brain network modeling, which may lead to some unreliable conclusions. To overcome this issue, we propose a novel brain functional network estimation method, which can simultaneously infer the causal mechanisms and temporal-lag values among brain regions. Specifically, our method converts the lag learning into an instantaneous effect estimation problem, and further embeds the search objectives into a deep neural network model as parameters to be learned. To verify the effectiveness of the proposed estimation method, we perform experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database by comparing the proposed model with several existing methods, including correlation-based and causality-based methods. The experimental results show that our brain networks constructed by the proposed estimation method can not only achieve promising classification performance, but also exhibit some characteristics of physiological mechanisms. Our approach provides a new perspective for understanding the pathogenesis of brain diseases. The source code is released at https://github.com/NJUSTxiazw/CTLN.

URL: [‘https://github.com/NJUSTxiazw/CTLN’]

Precise and Rapid Whole-Head Segmentation from Magnetic Resonance Images of Older Adults using Deep Learning.

Whole-head segmentation from Magnetic Resonance Images (MRI) establishes the foundation for individualized computational models using finite element method (FEM). This foundation paves the path for computer-aided solutions in fields, particularly in non-invasive brain stimulation. Most current automatic head segmentation tools are developed using healthy young adults. Thus, they may neglect the older population that is more prone to age-related structural decline such as brain atrophy. In this work, we present a new deep learning method called GRACE, which stands for General, Rapid, And Comprehensive whole-hEad tissue segmentation. GRACE is trained and validated on a novel dataset that consists of 177 manually corrected MR-derived reference segmentations that have undergone meticulous manual review. Each T1-weighted MRI volume is segmented into 11 tissue types, including white matter, grey matter, eyes, cerebrospinal fluid, air, blood vessel, cancellous bone, cortical bone, skin, fat, and muscle. To the best of our knowledge, this work contains the largest manually corrected dataset to date in terms of number of MRIs and segmented tissues. GRACE outperforms five freely available software tools and a traditional 3D U-Net on a five-tissue segmentation task. On this task, GRACE achieves an average Hausdorff Distance of 0.21, which exceeds the runner-up at an average Hausdorff Distance of 0.36. GRACE can segment a whole-head MRI in about 3 seconds, while the fastest software tool takes about 3 minutes. In summary, GRACE segments a spectrum of tissue types from older adults T1-MRI scans at favorable accuracy and speed. The trained GRACE model is optimized on older adult heads to enable high-precision modeling in age-related brain disorders. To support open science, the GRACE code and trained weights are made available online and open to the research community at https://github.com/lab-smile/GRACE.

URL: [‘https://github.com/lab-smile/GRACE’]

“Recon-all-clinical”: Cortical surface reconstruction and analysis of heterogeneous clinical brain MRI.

Surface-based analysis of the cerebral cortex is ubiquitous in human neuroimaging with MRI. It is crucial for tasks like cortical registration, parcellation, and thickness estimation. Traditionally, such analyses require high-resolution, isotropic scans with good gray-white matter contrast, typically a T1-weighted scan with 1 mm resolution. This requirement precludes application of these techniques to most MRI scans acquired for clinical purposes, since they are often anisotropic and lack the required T1-weighted contrast. To overcome this limitation and enable large-scale neuroimaging studies using vast amounts of existing clinical data, we introduce recon-all-clinical, a novel methodology for cortical reconstruction, registration, parcellation, and thickness estimation for clinical brain MRI scans of any resolution and contrast. Our approach employs a hybrid analysis method that combines a convolutional neural network (CNN) trained with domain randomization to predict signed distance functions (SDFs), and classical geometry processing for accurate surface placement while maintaining topological and geometric constraints. The method does not require retraining for different acquisitions, thus simplifying the analysis of heterogeneous clinical datasets. We evaluated recon-all-clinical on multiple public datasets like ADNI, HCP, AIBL, OASIS and including a large clinical dataset of over 9,500 scans. The results indicate that our method produces geometrically precise cortical reconstructions across different MRI contrasts and resolutions, consistently achieving high accuracy in parcellation. Cortical thickness estimates are precise enough to capture aging effects, independently of MRI contrast, even though accuracy varies with slice thickness. Our method is publicly available at https://surfer.nmr.mgh.harvard.edu/fswiki/recon-all-clinical, enabling researchers to perform detailed cortical analysis on the huge amounts of already existing clinical MRI scans. This advancement may be particularly valuable for studying rare diseases and underrepresented populations where research-grade MRI data is scarce.

URL: [‘https://surfer.nmr.mgh.harvard.edu/fswiki/recon-all-clinical’]

White matter hyperintensities segmentation using the ensemble U-Net with multi-scale highlighting foregrounds.

White matter hyperintensities (WMHs) are abnormal signals within the white matter region on the human brain MRI and have been associated with aging processes, cognitive decline, and dementia. In the current study, we proposed a U-Net with multi-scale highlighting foregrounds (HF) for WMHs segmentation. Our method, U-Net with HF, is designed to improve the detection of the WMH voxels with partial volume effects. We evaluated the segmentation performance of the proposed approach using the Challenge training dataset. Then we assessed the clinical utility of the WMH volumes that were automatically computed using our method and the Alzheimer’s Disease Neuroimaging Initiative database. We demonstrated that the U-Net with HF significantly improved the detection of the WMH voxels at the boundary of the WMHs or in small WMH clusters quantitatively and qualitatively. Up to date, the proposed method has achieved the best overall evaluation scores, the highest dice similarity index, and the best F1-score among 39 methods submitted on the WMH Segmentation Challenge that was initially hosted by MICCAI 2017 and is continuously accepting new challengers. The evaluation of the clinical utility showed that the WMH volume that was automatically computed using U-Net with HF was significantly associated with cognitive performance and improves the classification between cognitive normal and Alzheimer’s disease subjects and between patients with mild cognitive impairment and those with Alzheimer’s disease. The implementation of our proposed method is publicly available using Dockerhub (https://hub.docker.com/r/wmhchallenge/pgs).

URL: [‘https://hub.docker.com/r/wmhchallenge/pgs’]

Metadata-conditioned generative models to synthesize anatomically-plausible 3D brain MRIs.

Recent advances in generative models have paved the way for enhanced generation of natural and medical images, including synthetic brain MRIs. However, the mainstay of current AI research focuses on optimizing synthetic MRIs with respect to visual quality (such as signal-to-noise ratio) while lacking insights into their relevance to neuroscience. To generate high-quality T1-weighted MRIs relevant for neuroscience discovery, we present a two-stage Diffusion Probabilistic Model (called BrainSynth) to synthesize high-resolution MRIs conditionally-dependent on metadata (such as age and sex). We then propose a novel procedure to assess the quality of BrainSynth according to how well its synthetic MRIs capture macrostructural properties of brain regions and how accurately they encode the effects of age and sex. Results indicate that more than half of the brain regions in our synthetic MRIs are anatomically plausible, i.e., the effect size between real and synthetic MRIs is small relative to biological factors such as age and sex. Moreover, the anatomical plausibility varies across cortical regions according to their geometric complexity. As is, the MRIs generated by BrainSynth significantly improve the training of a predictive model to identify accelerated aging effects in an independent study. These results indicate that our model accurately capture the brain’s anatomical information and thus could enrich the data of underrepresented samples in a study. The code of BrainSynth will be released as part of the MONAI project at https://github.com/Project-MONAI/GenerativeModels.

URL: [‘https://github.com/Project-MONAI/GenerativeModels’]

Automated deep learning segmentation of high-resolution 7 Tesla postmortem MRI for quantitative analysis of structure-pathology correlations in neurodegenerative diseases.

Postmortem MRI allows brain anatomy to be examined at high resolution and to link pathology measures with morphometric measurements. However, automated segmentation methods for brain mapping in postmortem MRI are not well developed, primarily due to limited availability of labeled datasets, and heterogeneity in scanner hardware and acquisition protocols. In this work, we present a high-resolution dataset of 135 postmortem human brain tissue specimens imaged at 0.3 mm3 isotropic using a T2w sequence on a 7T whole-body MRI scanner. We developed a deep learning pipeline to segment the cortical mantle by benchmarking the performance of nine deep neural architectures, followed by post-hoc topological correction. We evaluate the reliability of this pipeline via overlap metrics with manual segmentation in 6 specimens, and intra-class correlation between cortical thickness measures extracted from the automatic segmentation and expert-generated reference measures in 36 specimens. We also segment four subcortical structures (caudate, putamen, globus pallidus, and thalamus), white matter hyperintensities, and the normal appearing white matter, providing a limited evaluation of accuracy. We show generalizing capabilities across whole-brain hemispheres in different specimens, and also on unseen images acquired at 0.28 mm3 and 0.16 mm3 isotropic T2*w fast low angle shot (FLASH) sequence at 7T. We report associations between localized cortical thickness and volumetric measurements across key regions, and semi-quantitative neuropathological ratings in a subset of 82 individuals with Alzheimer’s disease (AD) continuum diagnoses. Our code, Jupyter notebooks, and the containerized executables are publicly available at the project webpage (https://pulkit-khandelwal.github.io/exvivo-brain-upenn/).

URL: [‘https://pulkit-khandelwal.github.io/exvivo-brain-upenn/’]

Hyperfusion: A hypernetwork approach to multimodal integration of tabular and medical imaging data for predictive modeling.

The integration of diverse clinical modalities such as medical imaging and the tabular data extracted from patients’ Electronic Health Records (EHRs) is a crucial aspect of modern healthcare. Integrative analysis of multiple sources can provide a comprehensive understanding of the clinical condition of a patient, improving diagnosis and treatment decision. Deep Neural Networks (DNNs) consistently demonstrate outstanding performance in a wide range of multimodal tasks in the medical domain. However, the complex endeavor of effectively merging medical imaging with clinical, demographic and genetic information represented as numerical tabular data remains a highly active and ongoing research pursuit. We present a novel framework based on hypernetworks to fuse clinical imaging and tabular data by conditioning the image processing on the EHR’s values and measurements. This approach aims to leverage the complementary information present in these modalities to enhance the accuracy of various medical applications. We demonstrate the strength and generality of our method on two different brain Magnetic Resonance Imaging (MRI) analysis tasks, namely, brain age prediction conditioned by subject’s sex and multi-class Alzheimer’s Disease (AD) classification conditioned by tabular data. We show that our framework outperforms both single-modality models and state-of-the-art MRI tabular data fusion methods. A link to our code can be found at https://github.com/daniel4725/HyperFusion.

URL: [‘https://github.com/daniel4725/HyperFusion’]

Multi-Modal Diagnosis of Alzheimer’s Disease using Interpretable Graph Convolutional Networks.

The interconnection between brain regions in neurological disease encodes vital information for the advancement of biomarkers and diagnostics. Although graph convolutional networks are widely applied for discovering brain connection patterns that point to disease conditions, the potential of connection patterns that arise from multiple imaging modalities has yet to be fully realized. In this paper, we propose a multi-modal sparse interpretable GCN framework (SGCN) for the detection of Alzheimer’s disease (AD) and its prodromal stage, known as mild cognitive impairment (MCI). In our experimentation, SGCN learned the sparse regional importance probability to find signature regions of interest (ROIs), and the connective importance probability to reveal disease-specific brain network connections. We evaluated SGCN on the Alzheimer’s Disease Neuroimaging Initiative database with multi-modal brain images and demonstrated that the ROI features learned by SGCN were effective for enhancing AD status identification. The identified abnormalities were significantly correlated with AD-related clinical symptoms. We further interpreted the identified brain dysfunctions at the level of large-scale neural systems and sex-related connectivity abnormalities in AD/MCI. The salient ROIs and the prominent brain connectivity abnormalities interpreted by SGCN are considerably important for developing novel biomarkers. These findings contribute to a better understanding of the network-based disorder via multi-modal diagnosis and offer the potential for precision diagnostics. The source code is available at https://github.com/Houliang-Zhou/SGCN.

URL: [‘https://github.com/Houliang-Zhou/SGCN’]

ICAM-Reg: Interpretable Classification and Regression With Feature Attribution for Mapping Neurological Phenotypes in Individual Scans.

An important goal of medical imaging is to be able to precisely detect patterns of disease specific to individual scans; however, this is challenged in brain imaging by the degree of heterogeneity of shape and appearance. Traditional methods, based on image registration, historically fail to detect variable features of disease, as they utilise population-based analyses, suited primarily to studying group-average effects. In this paper we therefore take advantage of recent developments in generative deep learning to develop a method for simultaneous classification, or regression, and feature attribution (FA). Specifically, we explore the use of a VAE-GAN (variational autoencoder - general adversarial network) for translation called ICAM, to explicitly disentangle class relevant features, from background confounds, for improved interpretability and regression of neurological phenotypes. We validate our method on the tasks of Mini-Mental State Examination (MMSE) cognitive test score prediction for the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort, as well as brain age prediction, for both neurodevelopment and neurodegeneration, using the developing Human Connectome Project (dHCP) and UK Biobank datasets. We show that the generated FA maps can be used to explain outlier predictions and demonstrate that the inclusion of a regression module improves the disentanglement of the latent space. Our code is freely available on GitHub https://github.com/CherBass/ICAM.

URL: [‘https://github.com/CherBass/ICAM’]

A benchmark for hypothalamus segmentation on T1-weighted MR images.

The hypothalamus is a small brain structure that plays essential roles in sleep regulation, body temperature control, and metabolic homeostasis. Hypothalamic structural abnormalities have been reported in neuropsychiatric disorders, such as schizophrenia, amyotrophic lateral sclerosis, and Alzheimer’s disease. Although mag- netic resonance (MR) imaging is the standard examination method for evaluating this region, hypothalamic morphological landmarks are unclear, leading to subjec- tivity and high variability during manual segmentation. Due to these limitations, it is common to find contradicting results in the literature regarding hypothalamic volumetry. To the best of our knowledge, only two automated methods are available in the literature for hypothalamus segmentation, the first of which is our previous method based on U-Net. However, both methods present performance losses when predicting images from different datasets than those used in training. Therefore, this project presents a benchmark consisting of a diverse T1-weighted MR image dataset comprising 1381 subjects from IXI, CC359, OASIS, and MiLI (the latter created specifically for this benchmark). All data were provided using automatically generated hypothalamic masks and a subset containing manually annotated masks. As a baseline, a method for fully automated segmentation of the hypothalamus on T1-weighted MR images with a greater generalization ability is presented. The pro- posed method is a teacher-student-based model with two blocks: segmentation and correction, where the second corrects the imperfections of the first block. After using three datasets for training (MiLI, IXI, and CC359), the prediction performance of the model was measured on two test sets: the first was composed of data from IXI, CC359, and MiLI, achieving a Dice coefficient of 0.83; the second was from OASIS, a dataset not used for training, achieving a Dice coefficient of 0.74. The dataset, the baseline model, and all necessary codes to reproduce the experiments are available at https://github.com/MICLab-Unicamp/HypAST and https://sites.google.com/ view/calgary-campinas-dataset/hypothalamus-benchmarking. In addition, a leaderboard will be maintained with predictions for the test set submitted by anyone working on the same task.

URL: [‘https://github.com/MICLab-Unicamp/HypAST’, ‘https://sites.google.com/’]

Reproducible evaluation of classification methods in Alzheimer’s disease: Framework and application to MRI and PET data.

A large number of papers have introduced novel machine learning and feature extraction methods for automatic classification of Alzheimer’s disease (AD). However, while the vast majority of these works use the public dataset ADNI for evaluation, they are difficult to reproduce because different key components of the validation are often not readily available. These components include selected participants and input data, image preprocessing and cross-validation procedures. The performance of the different approaches is also difficult to compare objectively. In particular, it is often difficult to assess which part of the method (e.g. preprocessing, feature extraction or classification algorithms) provides a real improvement, if any. In the present paper, we propose a framework for reproducible and objective classification experiments in AD using three publicly available datasets (ADNI, AIBL and OASIS). The framework comprises: i) automatic conversion of the three datasets into a standard format (BIDS); ii) a modular set of preprocessing pipelines, feature extraction and classification methods, together with an evaluation framework, that provide a baseline for benchmarking the different components. We demonstrate the use of the framework for a large-scale evaluation on 1960 participants using T1 MRI and FDG PET data. In this evaluation, we assess the influence of different modalities, preprocessing, feature types (regional or voxel-based features), classifiers, training set sizes and datasets. Performances were in line with the state-of-the-art. FDG PET outperformed T1 MRI for all classification tasks. No difference in performance was found for the use of different atlases, image smoothing, partial volume correction of FDG PET images, or feature type. Linear SVM and L2-logistic regression resulted in similar performance and both outperformed random forests. The classification performance increased along with the number of subjects used for training. Classifiers trained on ADNI generalized well to AIBL and OASIS. All the code of the framework and the experiments is publicly available: general-purpose tools have been integrated into the Clinica software (www.clinica.run) and the paper-specific code is available at: https://gitlab.icm-institute.org/aramislab/AD-ML.

URL: [‘https://gitlab.icm-institute.org/aramislab/AD-ML’]

Automated recognition and analysis of body bending behavior in C.

elegans. BACKGROUND: Locomotion behaviors of Caenorhabditis elegans play an important role in drug activity screening, anti-aging research, and toxicological assessment. Previous studies have provided important insights into drug activity screening, anti-aging, and toxicological research by manually counting the number of body bends. However, manual counting is often low-throughput and takes a lot of time and manpower. And it is easy to cause artificial bias and error in counting results. RESULTS: In this paper, an algorithm is proposed for automatic counting and analysis of the body bending behavior of nematodes. First of all, the numerical coordinate regression method with convolutional neural network is used to obtain the head and tail coordinates. Next, curvature-based feature point extraction algorithm is used to calculate the feature points of the nematode centerline. Then the maximum distance between the peak point and the straight line between the pharynx and the tail is calculated. The number of body bends is counted according to the change in the maximum distance per frame. CONCLUSION: Experiments are performed to prove the effectiveness of the proposed algorithm. The accuracy of head coordinate prediction is 0.993, and the accuracy of tail coordinate prediction is 0.990. The Pearson correlation coefficient between the results of the automatic count and manual count of the number of body bends is 0.998 and the mean absolute error is 1.931. Different strains of nematodes are selected to analyze differences in body bending behavior, demonstrating a relationship between nematode vitality and lifespan. The code is freely available at https://github.com/hthana/Body-Bend-Count .

URL: [‘https://github.com/hthana/Body-Bend-Count’]

MUTATE: a human genetic atlas of multiorgan artificial intelligence endophenotypes using genome-wide association summary statistics.

Artificial intelligence (AI) has been increasingly integrated into imaging genetics to provide intermediate phenotypes (i.e. endophenotypes) that bridge the genetics and clinical manifestations of human disease. However, the genetic architecture of these AI endophenotypes remains largely unexplored in the context of human multiorgan system diseases. Using publicly available genome-wide association study summary statistics from the UK Biobank (UKBB), FinnGen, and the Psychiatric Genomics Consortium, we comprehensively depicted the genetic architecture of 2024 multiorgan AI endophenotypes (MAEs). We comparatively assessed the single-nucleotide polymorphism-based heritability, polygenicity, and natural selection signatures of 2024 MAEs using methods commonly used in the field. Genetic correlation and Mendelian randomization analyses reveal both within-organ relationships and cross-organ interconnections. Bi-directional causal relationships were established between chronic human diseases and MAEs across multiple organ systems, including Alzheimer’s disease for the brain, diabetes for the metabolic system, asthma for the pulmonary system, and hypertension for the cardiovascular system. Finally, we derived polygenic risk scores for the 2024 MAEs for individuals not used to calculate MAEs and returned these to the UKBB. Our findings underscore the promise of the MAEs as new instruments to ameliorate overall human health. All results are encapsulated into the MUlTiorgan AI endophenoTypE genetic atlas and are publicly available at https://labs-laboratory.com/mutate.

URL: [‘https://labs-laboratory.com/mutate’]

AITeQ: a machine learning framework for Alzheimer’s prediction using a distinctive five-gene signature.

Neurodegenerative diseases, such as Alzheimer’s disease, pose a significant global health challenge with their complex etiology and elusive biomarkers. In this study, we developed the Alzheimer’s Identification Tool (AITeQ) using ribonucleic acid-sequencing (RNA-seq), a machine learning (ML) model based on an optimized ensemble algorithm for the identification of Alzheimer’s from RNA-seq data. Analysis of RNA-seq data from several studies identified 87 differentially expressed genes. This was followed by a ML protocol involving feature selection, model training, performance evaluation, and hyperparameter tuning. The feature selection process undertaken in this study, employing a combination of four different methodologies, culminated in the identification of a compact yet impactful set of five genes. Twelve diverse ML models were trained and tested using these five genes (CNKSR1, EPHA2, CLSPN, OLFML3, and TARBP1). Performance metrics, including precision, recall, F1 score, accuracy, Matthew’s correlation coefficient, and receiver operating characteristic area under the curve were assessed for the finally selected model. Overall, the ensemble model consisting of logistic regression, naive Bayes classifier, and support vector machine with optimized hyperparameters was identified as the best and was used to develop AITeQ. AITeQ is available at: https://github.com/ishtiaque-ahammad/AITeQ.

URL: [‘https://github.com/ishtiaque-ahammad/AITeQ’]

Alzheimer’s disease diagnosis from multi-modal data via feature inductive learning and dual multilevel graph neural network.

Multi-modal data can provide complementary information of Alzheimer’s disease (AD) and its development from different perspectives. Such information is closely related to the diagnosis, prevention, and treatment of AD, and hence it is necessary and critical to study AD through multi-modal data. Existing learning methods, however, usually ignore the influence of feature heterogeneity and directly fuse features in the last stages. Furthermore, most of these methods only focus on local fusion features or global fusion features, neglecting the complementariness of features at different levels and thus not sufficiently leveraging information embedded in multi-modal data. To overcome these shortcomings, we propose a novel framework for AD diagnosis that fuses gene, imaging, protein, and clinical data. Our framework learns feature representations under the same feature space for different modalities through a feature induction learning (FIL) module, thereby alleviating the impact of feature heterogeneity. Furthermore, in our framework, local and global salient multi-modal feature interaction information at different levels is extracted through a novel dual multilevel graph neural network (DMGNN). We extensively validate the proposed method on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset and experimental results demonstrate our method consistently outperforms other state-of-the-art multi-modal fusion methods. The code is publicly available on the GitHub website. (https://github.com/xiankantingqianxue/MIA-code.git).

URL: [‘https://github.com/xiankantingqianxue/MIA-code.git’]

UnCOT-AD: Unpaired Cross-Omics Translation Enables Multi-Omics Integration for Alzheimer’s Disease Prediction.

Alzheimer’s Disease (AD) is a progressive neurodegenerative disorder, posing a growing public health challenge. Traditional machine learning models for AD prediction have relied on single omics data or phenotypic assessments, limiting their ability to capture the disease’s molecular complexity and resulting in poor performance. Recent advances in high-throughput multi-omics have provided deeper biological insights. However, due to the scarcity of paired omics datasets, existing multi-omics AD prediction models rely on unpaired omics data, where different omics profiles are combined without being derived from the same biological sample, leading to biologically less meaningful pairings and causing less accurate predictions. To address these issues, we propose UnCOT-AD, a novel deep learning framework for Unpaired Cross-Omics Translation enabling effective multi-omics integration for AD prediction. Our method introduces the first-ever cross-omics translation model trained on unpaired omics datasets, using two coupled Variational Autoencoders and a novel cycle consistency mechanism to ensure accurate bidirectional translation between omics types. We integrate adversarial training to ensure that the generated omics profiles are biologically realistic. Moreover, we employ contrastive learning to capture the disease specific patterns in latent space to make the cross-omics translation more accurate and biologically relevant. We rigorously validate UnCOT-AD on both cross-omics translation and AD prediction tasks. Results show that UnCOT-AD empowers multi-omics based AD prediction by combining real omics profiles with corresponding omics profiles generated by our cross-omics translation module and achieves state-of-the-art performance in accuracy and robustness. Source code is available at https://github.com/abrarrahmanabir/UnCOT-AD.

URL: [‘https://github.com/abrarrahmanabir/UnCOT-AD’]

CE-GAN: Community Evolutionary Generative Adversarial Network for Alzheimer’s Disease Risk Prediction.

In the studies of neurodegenerative diseases such as Alzheimer’s Disease (AD), researchers often focus on the associations among multi-omics pathogeny based on imaging genetics data. However, current studies overlook the communities in brain networks, leading to inaccurate models of disease development. This paper explores the developmental patterns of AD from the perspective of community evolution. We first establish a mathematical model to describe functional degeneration in the brain as the community evolution driven by entropy information propagation. Next, we propose an interpretable Community Evolutionary Generative Adversarial Network (CE-GAN) to predict disease risk. In the generator of CE-GAN, community evolutionary convolutions are designed to capture the evolutionary patterns of AD. The experiments are conducted using functional magnetic resonance imaging (fMRI) data and single nucleotide polymorphism (SNP) data. CE-GAN achieves 91.67% accuracy and 91.83% area under curve (AUC) in AD risk prediction tasks, surpassing advanced methods on the same dataset. In addition, we validated the effectiveness of CE-GAN for pathogeny extraction. The source code of this work is available at https://github.com/fmri123456/CE-GAN.

URL: [‘https://github.com/fmri123456/CE-GAN’]

Integrative analysis of multi-omics and imaging data with incorporation of biological information via structural Bayesian factor analysis.

MOTIVATION: With the rapid development of modern technologies, massive data are available for the systematic study of Alzheimer’s disease (AD). Though many existing AD studies mainly focus on single-modality omics data, multi-omics datasets can provide a more comprehensive understanding of AD. To bridge this gap, we proposed a novel structural Bayesian factor analysis framework (SBFA) to extract the information shared by multi-omics data through the aggregation of genotyping data, gene expression data, neuroimaging phenotypes and prior biological network knowledge. Our approach can extract common information shared by different modalities and encourage biologically related features to be selected, guiding future AD research in a biologically meaningful way. METHOD: Our SBFA model decomposes the mean parameters of the data into a sparse factor loading matrix and a factor matrix, where the factor matrix represents the common information extracted from multi-omics and imaging data. Our framework is designed to incorporate prior biological network information. Our simulation study demonstrated that our proposed SBFA framework could achieve the best performance compared with the other state-of-the-art factor-analysis-based integrative analysis methods. RESULTS: We apply our proposed SBFA model together with several state-of-the-art factor analysis models to extract the latent common information from genotyping, gene expression and brain imaging data simultaneously from the ADNI biobank database. The latent information is then used to predict the functional activities questionnaire score, an important measurement for diagnosis of AD quantifying subjects’ abilities in daily life. Our SBFA model shows the best prediction performance compared with the other factor analysis models. AVAILABILITY: Code are publicly available at https://github.com/JingxuanBao/SBFA. CONTACT: qlong@upenn.edu.

URL: [‘https://github.com/JingxuanBao/SBFA’]

Multi-task prediction-based graph contrastive learning for inferring the relationship among lncRNAs, miRNAs and diseases.

MOTIVATION: Identifying the relationships among long non-coding RNAs (lncRNAs), microRNAs (miRNAs) and diseases is highly valuable for diagnosing, preventing, treating and prognosing diseases. The development of effective computational prediction methods can reduce experimental costs. While numerous methods have been proposed, they often to treat the prediction of lncRNA-disease associations (LDAs), miRNA-disease associations (MDAs) and lncRNA-miRNA interactions (LMIs) as separate task. Models capable of predicting all three relationships simultaneously remain relatively scarce. Our aim is to perform multi-task predictions, which not only construct a unified framework, but also facilitate mutual complementarity of information among lncRNAs, miRNAs and diseases. RESULTS: In this work, we propose a novel unsupervised embedding method called graph contrastive learning for multi-task prediction (GCLMTP). Our approach aims to predict LDAs, MDAs and LMIs by simultaneously extracting embedding representations of lncRNAs, miRNAs and diseases. To achieve this, we first construct a triple-layer lncRNA-miRNA-disease heterogeneous graph (LMDHG) that integrates the complex relationships between these entities based on their similarities and correlations. Next, we employ an unsupervised embedding model based on graph contrastive learning to extract potential topological feature of lncRNAs, miRNAs and diseases from the LMDHG. The graph contrastive learning leverages graph convolutional network architectures to maximize the mutual information between patch representations and corresponding high-level summaries of the LMDHG. Subsequently, for the three prediction tasks, multiple classifiers are explored to predict LDA, MDA and LMI scores. Comprehensive experiments are conducted on two datasets (from older and newer versions of the database, respectively). The results show that GCLMTP outperforms other state-of-the-art methods for the disease-related lncRNA and miRNA prediction tasks. Additionally, case studies on two datasets further demonstrate the ability of GCLMTP to accurately discover new associations. To ensure reproducibility of this work, we have made the datasets and source code publicly available at https://github.com/sheng-n/GCLMTP.

URL: [‘https://github.com/sheng-n/GCLMTP’]

DeepPerVar: a multi-modal deep learning framework for functional interpretation of genetic variants in personal genome.

MOTIVATION: Understanding the functional consequence of genetic variants, especially the non-coding ones, is important but particularly challenging. Genome-wide association studies (GWAS) or quantitative trait locus analyses may be subject to limited statistical power and linkage disequilibrium, and thus are less optimal to pinpoint the causal variants. Moreover, most existing machine-learning approaches, which exploit the functional annotations to interpret and prioritize putative causal variants, cannot accommodate the heterogeneity of personal genetic variations and traits in a population study, targeting a specific disease. RESULTS: By leveraging paired whole-genome sequencing data and epigenetic functional assays in a population study, we propose a multi-modal deep learning framework to predict genome-wide quantitative epigenetic signals by considering both personal genetic variations and traits. The proposed approach can further evaluate the functional consequence of non-coding variants on an individual level by quantifying the allelic difference of predicted epigenetic signals. By applying the approach to the ROSMAP cohort studying Alzheimer’s disease (AD), we demonstrate that the proposed approach can accurately predict quantitative genome-wide epigenetic signals and in key genomic regions of AD causal genes, learn canonical motifs reported to regulate gene expression of AD causal genes, improve the partitioning heritability analysis and prioritize putative causal variants in a GWAS risk locus. Finally, we release the proposed deep learning model as a stand-alone Python toolkit and a web server. AVAILABILITY AND IMPLEMENTATION: https://github.com/lichen-lab/DeepPerVar. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/lichen-lab/DeepPerVar’]

Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes.

MOTIVATION: Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. RESULTS: We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer’s disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. AVAILABILITY: The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana.

URL: [‘http://www2.imperial.ac.uk/’]

Debiased machine learning for ultra-high dimensional mediation analysis.

MOTIVATION: In ultra-high dimensional mediation analysis, confounding variables can influence both mediators and outcomes through complex functional forms. While machine learning (ML) approaches are effective at modeling such complex relationships, they can introduce bias when estimating mediation effects. In this paper, we propose a debiased ML framework that mitigates this bias, enabling accurate identification of key mediators and precise estimation and inference of their respective contributions. RESULTS: We construct an orthogonalized score function and employ cross-fitting to reduce bias introduced by ML. To tackle ultra-high dimensional potential mediators, we implement screening and regularization techniques for variable selection and effect estimation. For statistical inference of the mediators’ contributions, we use an adjusted Sobel-type (ASobel) test. Simulation results demonstrate the superior performance of the proposed method in handling complex confounding. Applying this method to Alzheimer’s Disease Neuroimaging Initiative (ADNI) data, we identify several cytosine-phosphate-guanine (CpG) sites where DNA methylation (DNAm) mediates the effect of body mass index (BMI) on Alzheimer’s Disease (AD). AVAILABILITY AND IMPLEMENTATION: The R function DML_HDMA implementing the proposed methods is available online at https://github.com/Wei-Kecheng/DML_HDMA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/Wei-Kecheng/DML_HDMA’]

Morbigenous brain region and gene detection with a genetically evolved random neural network cluster approach in late mild cognitive impairment.

MOTIVATION: The multimodal data fusion analysis becomes another important field for brain disease detection and increasing researches concentrate on using neural network algorithms to solve a range of problems. However, most current neural network optimizing strategies focus on internal nodes or hidden layer numbers, while ignoring the advantages of external optimization. Additionally, in the multimodal data fusion analysis of brain science, the problems of small sample size and high-dimensional data are often encountered due to the difficulty of data collection and the specialization of brain science data, which may result in the lower generalization performance of neural network. RESULTS: We propose a genetically evolved random neural network cluster (GERNNC) model. Specifically, the fusion characteristics are first constructed to be taken as the input and the best type of neural network is selected as the base classifier to form the initial random neural network cluster. Second, the cluster is adaptively genetically evolved. Based on the GERNNC model, we further construct a multi-tasking framework for the classification of patients with brain disease and the extraction of significant characteristics. In a study of genetic data and functional magnetic resonance imaging data from the Alzheimer’s Disease Neuroimaging Initiative, the framework exhibits great classification performance and strong morbigenous factor detection ability. This work demonstrates that how to effectively detect pathogenic components of the brain disease on the high-dimensional medical data and small samples. AVAILABILITY AND IMPLEMENTATION: The Matlab code is available at https://github.com/lizi1234560/GERNNC.git.

URL: [‘https://github.com/lizi1234560/GERNNC.git’]

Coupled mixed model for joint genetic analysis of complex disorders with two independently collected data sets.

BACKGROUND: In the last decade, Genome-wide Association studies (GWASs) have contributed to decoding the human genome by uncovering many genetic variations associated with various diseases. Many follow-up investigations involve joint analysis of multiple independently generated GWAS data sets. While most of the computational approaches developed for joint analysis are based on summary statistics, the joint analysis based on individual-level data with consideration of confounding factors remains to be a challenge. RESULTS: In this study, we propose a method, called Coupled Mixed Model (CMM), that enables a joint GWAS analysis on two independently collected sets of GWAS data with different phenotypes. The CMM method does not require the data sets to have the same phenotypes as it aims to infer the unknown phenotypes using a set of multivariate sparse mixed models. Moreover, CMM addresses the confounding variables due to population stratification, family structures, and cryptic relatedness, as well as those arising during data collection such as batch effects that frequently appear in joint genetic studies. We evaluate the performance of CMM using simulation experiments. In real data analysis, we illustrate the utility of CMM by an application to evaluating common genetic associations for Alzheimer’s disease and substance use disorder using datasets independently collected for the two complex human disorders. Comparison of the results with those from previous experiments and analyses supports the utility of our method and provides new insights into the diseases. The software is available at https://github.com/HaohanWang/CMM .

URL: [‘https://github.com/HaohanWang/CMM’]

Machine learning-based quantification for disease uncertainty increases the statistical power of genetic association studies.

MOTIVATION: Allowance for increasingly large samples is a key to identify the association of genetic variants with Alzheimer’s disease (AD) in genome-wide association studies (GWAS). Accordingly, we aimed to develop a method that incorporates patients with mild cognitive impairment (MCI) and unknown cognitive status in GWAS using a machine learning-based AD prediction model. RESULTS: Simulation analyses showed that weighting imputed phenotypes (WIP) method increased the statistical power compared to ordinary logistic regression using only AD cases and controls. Applied to real-world data, the penalized logistic method had the highest AUC (0.96) for AD prediction and WIP method performed well in terms of power. We identified an association (p < 5.0x10-8) of AD with several variants in the APOE region and rs143625563 in LMX1A. Our method, which allows the inclusion of individuals with MCI, improves the statistical power of GWAS for AD. We discovered a novel association with LMX1A. AVAILABILITY AND IMPLEMENTATION: Simulation codes can be accessed at https://github.com/Junkkkk/wGEE_GWAS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/Junkkkk/wGEE_GWAS’]

Learning directed acyclic graphical structures with genetical genomics data.

MOTIVATION: Large amount of research efforts have been focused on estimating gene networks based on gene expression data to understand the functional basis of a living organism. Such networks are often obtained by considering pairwise correlations between genes, thus may not reflect the true connectivity between genes. By treating gene expressions as quantitative traits while considering genetic markers, genetical genomics analysis has shown its power in enhancing the understanding of gene regulations. Previous works have shown the improved performance on estimating the undirected network graphical structure by incorporating genetic markers as covariates. Knowing that gene expressions are often due to directed regulations, it is more meaningful to estimate the directed graphical network. RESULTS: In this article, we introduce a covariate-adjusted Gaussian graphical model to estimate the Markov equivalence class of the directed acyclic graphs (DAGs) in a genetical genomics analysis framework. We develop a two-stage estimation procedure to first estimate the regression coefficient matrix by [Formula: see text] penalization. The estimated coefficient matrix is then used to estimate the mean values in our multi-response Gaussian model to estimate the regulatory networks of gene expressions using PC-algorithm. The estimation consistency for high dimensional sparse DAGs is established. Simulations are conducted to demonstrate our theoretical results. The method is applied to a human Alzheimer’s disease dataset in which differential DAGs are identified between cases and controls. R code for implementing the method can be downloaded at http://www.stt.msu.edu/~cui. AVAILABILITY AND IMPLEMENTATION: R code for implementing the method is freely available at http://www.stt.msu.edu/~cui/software.html.

URL: [‘http://www.stt.msu.edu/’, ‘http://www.stt.msu.edu/’]

Deconfounded and debiased estimation for high-dimensional linear regression under hidden confounding with application to omics data.

MOTIVATION: A critical challenge in observational studies arises from the presence of hidden confounders in high-dimensional data. This leads to biases in causal effect estimation due to both hidden confounding and high-dimensional estimation. Some classical deconfounding methods are inadequate for high-dimensional scenarios and typically require prior information on hidden confounders. We propose a two-step deconfounded and debiased estimation for high-dimensional linear regression with hidden confounding. RESULTS: First, we reduce hidden confounding via spectral transformation. Second, we correct bias from the weighted l1 penalty, commonly used in high-dimensional estimation, by inverting the Karush-Kuhn-Tucker conditions and solving convex optimization programs. This deconfounding technique by spectral transformation requires no prior knowledge of hidden confounders. This novel debiasing approach improves over recent work by not assuming a sparse precision matrix, making it more suitable for cases with intrinsic covariate correlations. Simulations show that the proposed method corrects both biases and provides more precise coefficient estimates than existing approaches. We also apply the proposed method to a deoxyribonucleic acid methylation dataset from the Alzheimer’s disease (AD) neuroimaging initiative database to investigate the association between cerebrospinal fluid tau protein levels and AD severity. AVAILABILITY: The code for the proposed method is available on GitHub (https://github.com/Li-Zhaoy/Dec-Deb.git) and archived on Zenodo (DOI: 10.5281/zenodo.15478745). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/Li-Zhaoy/Dec-Deb.git’]

Discovering network phenotype between genetic risk factors and disease status via diagnosis-aligned multi-modality regression method in Alzheimer’s disease.

MOTIVATION: Neuroimaging genetics is an emerging field to identify the associations between genetic variants [e.g. single-nucleotide polymorphisms (SNPs)] and quantitative traits (QTs) such as brain imaging phenotypes. However, most of the current studies focus only on the associations between brain structure imaging and genetic variants, while neglecting the connectivity information between brain regions. In addition, the brain itself is a complex network, and the higher-order interaction may contain useful information for the mechanistic understanding of diseases [i.e. Alzheimer’s disease (AD)]. RESULTS: A general framework is proposed to exploit network voxel information and network connectivity information as intermediate traits that bridge genetic risk factors and disease status. Specifically, we first use the sparse representation (SR) model to build hyper-network to express the connectivity features of the brain. The network voxel node features and network connectivity edge features are extracted from the structural magnetic resonance imaging (sMRI) and resting-state functional magnetic resonance imaging (fMRI), respectively. Second, a diagnosis-aligned multi-modality regression method is adopted to fully explore the relationships among modalities of different subjects, which can help further mine the relation between the risk genetics and brain network features. In experiments, all methods are tested on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The experimental results not only verify the effectiveness of our proposed framework but also discover some brain regions and connectivity features that are highly related to diseases. AVAILABILITY AND IMPLEMENTATION: The Matlab code is available at http://ibrain.nuaa.edu.cn/2018/list.htm.

URL: [‘http://ibrain.nuaa.edu.cn/2018/list.htm’]

A gene-level methylome-wide association analysis identifies novel Alzheimer’s disease genes.

MOTIVATION: Transcriptome-wide association studies (TWAS) have successfully facilitated the discovery of novel genetic risk loci for many complex traits, including late-onset Alzheimer’s disease (AD). However, most existing TWAS methods rely only on gene expression and ignore epigenetic modification (i.e., DNA methylation) and functional regulatory information (i.e., enhancer-promoter interactions), both of which contribute significantly to the genetic basis of AD. RESULTS: We develop a novel gene-level association testing method that integrates genetically regulated DNA methylation and enhancer-target gene pairs with genome-wide association study (GWAS) summary results. Through simulations, we show that our approach, referred to as the CMO (cross methylome omnibus) test, yielded well controlled type I error rates and achieved much higher statistical power than competing methods under a wide range of scenarios. Furthermore, compared with TWAS, CMO identified an average of 124% more associations when analyzing several brain imaging-related GWAS results. By analyzing to date the largest AD GWAS of 71,880 cases and 383,378 controls, CMO identified six novel loci for AD, which have been ignored by competing methods. AVAILABILITY: Software: https://github.com/ChongWuLab/CMO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/ChongWuLab/CMO’]

Bayesian GWAS with Structured and Non-Local Priors.

MOTIVATION: The flexibility of a Bayesian framework is promising for GWAS, but current approaches can benefit from more informative prior models. We introduce a novel Bayesian approach to GWAS, called Structured and Non-Local Priors (SNLPs) GWAS, that improves over existing methods in two important ways. First, we describe a model that allows for a marker’s gene-parent membership and other characteristics to influence its probability of association with an outcome. Second, we describe a non-local alternative model for differential minor allele rates at each marker, in which the null and alternative hypotheses have no common support. RESULTS: We employ a non-parametric model that allows for clustering of the genes in tandem with a regression model for marker-level covariates, and demonstrate how incorporating these additional characteristics can improve power. We further demonstrate that our non-local alternative model gives symmetric rates of convergence for the null and alternative hypotheses, whereas commonly used local alternative models have asymptotic rates that favor the alternative hypothesis over the null. We demonstrate the robustness and flexibility of our structured and non-local model for different data generating scenarios and signal-to-noise ratios. We apply our Bayesian GWAS method to single nucleotide polymorphisms data collected from a pool of Alzheimer’s disease and cognitively normal patients from the Alzheimer’s Database Neuroimaging Initiative. AVAILABILITY AND IMPLEMENTATION: R code to perform the SNLPs method is available at https://github.com/lockEF/BayesianScreening.

URL: [‘https://github.com/lockEF/BayesianScreening’]

Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning.

MOTIVATION: Recent advances in brain imaging and high-throughput genotyping techniques enable new approaches to study the influence of genetic and anatomical variations on brain functions and disorders. Traditional association studies typically perform independent and pairwise analysis among neuroimaging measures, cognitive scores and disease status, and ignore the important underlying interacting relationships between these units. RESULTS: To overcome this limitation, in this article, we propose a new sparse multimodal multitask learning method to reveal complex relationships from gene to brain to symptom. Our main contributions are three-fold: (i) introducing combined structured sparsity regularizations into multimodal multitask learning to integrate multidimensional heterogeneous imaging genetics data and identify multimodal biomarkers; (ii) utilizing a joint classification and regression learning model to identify disease-sensitive and cognition-relevant biomarkers; (iii) deriving a new efficient optimization algorithm to solve our non-smooth objective function and providing rigorous theoretical analysis on the global optimum convergency. Using the imaging genetics data from the Alzheimer’s Disease Neuroimaging Initiative database, the effectiveness of the proposed method is demonstrated by clearly improved performance on predicting both cognitive scores and disease status. The identified multimodal biomarkers could predict not only disease status but also cognitive function to help elucidate the biological pathway from gene to brain structure and function, and to cognition and disease. AVAILABILITY: Software is publicly available at: http://ranger.uta.edu/%7eheng/multimodal/.

URL: [‘http://ranger.uta.edu/%7eheng/multimodal/’]

Incorporating spatial-anatomical similarity into the VGWAS framework for AD biomarker detection.

MOTIVATION: The detection of potential biomarkers of Alzheimer’s disease (AD) is crucial for its early prediction, diagnosis and treatment. Voxel-wise genome-wide association study (VGWAS) is a commonly used method in imaging genomics and usually applied to detect AD biomarkers in imaging and genetic data. However, existing VGWAS methods entail large computational cost and disregard spatial correlations within imaging data. A novel method is proposed to solve these issues. RESULTS: We introduce a novel method to incorporate spatial correlations into a VGWAS framework for the detection of potential AD biomarkers. To consider the characteristics of AD, we first present a modification of a simple linear iterative clustering method for spatial grouping in an anatomically meaningful manner. Second, we propose a spatial-anatomical similarity matrix to incorporate correlations among voxels. Finally, we detect the potential AD biomarkers from imaging and genetic data by using a fast VGWAS method and test our method on 708 subjects obtained from an Alzheimer’s Disease Neuroimaging Initiative dataset. Results show that our method can successfully detect some new risk genes and clusters of AD. The detected imaging and genetic biomarkers are used as predictors to classify AD/normal control subjects, and a high accuracy of AD/normal control classification is achieved. To the best of our knowledge, the association between imaging and genetic data has yet to be systematically investigated while building statistical models for classifying AD subjects to create a link between imaging genetics and AD. Therefore, our method may provide a new way to gain insights into the underlying pathological mechanism of AD. AVAILABILITY AND IMPLEMENTATION: https://github.com/Meiyan88/SASM-VGWAS.

URL: [‘https://github.com/Meiyan88/SASM-VGWAS’]

DEMA: a distance-bounded energy-field minimization algorithm to model and layout biomolecular networks with quantitative features.

SUMMARY: In biology, graph layout algorithms can reveal comprehensive biological contexts by visually positioning graph nodes in their relevant neighborhoods. A layout software algorithm/engine commonly takes a set of nodes and edges and produces layout coordinates of nodes according to edge constraints. However, current layout engines normally do not consider node, edge or node-set properties during layout and only curate these properties after the layout is created. Here, we propose a new layout algorithm, distance-bounded energy-field minimization algorithm (DEMA), to natively consider various biological factors, i.e., the strength of gene-to-gene association, the gene’s relative contribution weight and the functional groups of genes, to enhance the interpretation of complex network graphs. In DEMA, we introduce a parameterized energy model where nodes are repelled by the network topology and attracted by a few biological factors, i.e., interaction coefficient, effect coefficient and fold change of gene expression. We generalize these factors as gene weights, protein-protein interaction weights, gene-to-gene correlations and the gene set annotations-four parameterized functional properties used in DEMA. Moreover, DEMA considers further attraction/repulsion/grouping coefficient to enable different preferences in generating network views. Applying DEMA, we performed two case studies using genetic data in autism spectrum disorder and Alzheimer’s disease, respectively, for gene candidate discovery. Furthermore, we implement our algorithm as a plugin to Cytoscape, an open-source software platform for visualizing networks; hence, it is convenient. Our software and demo can be freely accessed at http://discovery.informatics.uab.edu/dema. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘http://discovery.informatics.uab.edu/dema’]

A Bayesian linear mixed model for prediction of complex traits.

MOTIVATION: Accurate disease risk prediction is essential for precision medicine. Existing models either assume that diseases are caused by groups of predictors with small-to-moderate effects or a few isolated predictors with large effects. Their performance can be sensitive to the underlying disease mechanisms, which are usually unknown in advance. RESULTS: We developed a Bayesian linear mixed model (BLMM), where genetic effects were modelled using a hybrid of the sparsity regression and linear mixed model with multiple random effects. The parameters in BLMM were inferred through a computationally efficient variational Bayes algorithm. The proposed method can resemble the shape of the true effect size distributions, captures the predictive effects from both common and rare variants, and is robust against various disease models. Through extensive simulations and the application to a whole-genome sequencing dataset obtained from the Alzheimer’s Disease Neuroimaging Initiatives, we have demonstrated that BLMM has better prediction performance than existing methods and can detect variables and/or genetic regions that are predictive. AVAILABILITYAND IMPLEMENTATION: The R-package is available at https://github.com/yhai943/BLMM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/yhai943/BLMM’]

A U-statistics for integrative analysis of multilayer omics data.

MOTIVATION: The emerging multilayer omics data provide unprecedented opportunities for detecting biomarkers that are associated with complex diseases at various molecular levels. However, the high-dimensionality of multiomics data and the complex disease etiologies have brought tremendous analytical challenges. RESULTS: We developed a U-statistics-based non-parametric framework for the association analysis of multilayer omics data, where consensus and permutation-based weighting schemes are developed to account for various types of disease models. Our proposed method is flexible for analyzing different types of outcomes as it makes no assumptions about their distributions. Moreover, it explicitly accounts for various types of underlying disease models through weighting schemes and thus provides robust performance against them. Through extensive simulations and the application to dataset obtained from the Alzheimer’s Disease Neuroimaging Initiatives, we demonstrated that our method outperformed the commonly used kernel regression-based methods. AVAILABILITY AND IMPLEMENTATION: The R-package is available at https://github.com/YaluWen/Uomic. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/YaluWen/Uomic’]

HyperTMO: a trusted multi-omics integration framework based on hypergraph convolutional network for patient classification.

MOTIVATION: The rapid development of high-throughput biomedical technologies can provide researchers with detailed multi-omics data. The multi-omics integrated analysis approach based on machine learning contributes a more comprehensive perspective to human disease research. However, there are still significant challenges in representing single-omics data and integrating multi-omics information. RESULTS: This paper presents HyperTMO, a Trusted Multi-Omics integration framework based on Hypergraph convolutional network for patient classification. HyperTMO constructs hypergraph structures to represent the association between samples in single-omics data, then evidence extraction is performed by hypergraph convolutional network, and multi-omics information is integrated at an evidence level. Lastly, we experimentally demonstrate that HyperTMO outperforms other state-of-the-art methods in breast cancer subtype classification and Alzheimer’s disease classification tasks using multi-omics data from TCGA (BRCA) and ROSMAP datasets. Importantly, HyperTMO is the first attempt to integrate hypergraph structure, evidence theory, and multi-omics integration for patient classification. Its accurate and robust properties bring great potential for applications in clinical diagnosis. AVAILABILITY: HyperTMO and datasets are publicly available at https://github.com/ippousyuga/HyperTMO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/ippousyuga/HyperTMO’]

Metapaths: similarity search in heterogeneous knowledge graphs via meta-paths.

SUMMARY: Heterogeneous knowledge graphs (KGs) have enabled the modeling of complex systems, from genetic interaction graphs and protein-protein interaction networks to networks representing drugs, diseases, proteins, and side effects. Analytical methods for KGs rely on quantifying similarities between entities, such as nodes, in the graph. However, such methods must consider the diversity of node and edge types contained within the KG via, for example, defined sequences of entity types known as meta-paths. We present metapaths, the first R software package to implement meta-paths and perform meta-path-based similarity search in heterogeneous KGs. The metapaths package offers various built-in similarity metrics for node pair comparison by querying KGs represented as either edge or adjacency lists, as well as auxiliary aggregation methods to measure set-level relationships. Indeed, evaluation of these methods on an open-source biomedical KG recovered meaningful drug and disease-associated relationships, including those in Alzheimer’s disease. The metapaths framework facilitates the scalable and flexible modeling of network similarities in KGs with applications across KG learning. AVAILABILITY AND IMPLEMENTATION: The metapaths R package is available via GitHub at https://github.com/ayushnoori/metapaths and is released under MPL 2.0 (Zenodo DOI: 10.5281/zenodo.7047209). Package documentation and usage examples are available at https://www.ayushnoori.com/metapaths.

URL: [‘https://github.com/ayushnoori/metapaths’, ‘https://www.ayushnoori.com/metapaths’]

Identifying progressive imaging genetic patterns via multi-task sparse canonical correlation analysis: a longitudinal study of the ADNI cohort.

MOTIVATION: Identifying the genetic basis of the brain structure, function and disorder by using the imaging quantitative traits (QTs) as endophenotypes is an important task in brain science. Brain QTs often change over time while the disorder progresses and thus understanding how the genetic factors play roles on the progressive brain QT changes is of great importance and meaning. Most existing imaging genetics methods only analyze the baseline neuroimaging data, and thus those longitudinal imaging data across multiple time points containing important disease progression information are omitted. RESULTS: We propose a novel temporal imaging genetic model which performs the multi-task sparse canonical correlation analysis (T-MTSCCA). Our model uses longitudinal neuroimaging data to uncover that how single nucleotide polymorphisms (SNPs) play roles on affecting brain QTs over the time. Incorporating the relationship of the longitudinal imaging data and that within SNPs, T-MTSCCA could identify a trajectory of progressive imaging genetic patterns over the time. We propose an efficient algorithm to solve the problem and show its convergence. We evaluate T-MTSCCA on 408 subjects from the Alzheimer’s Disease Neuroimaging Initiative database with longitudinal magnetic resonance imaging data and genetic data available. The experimental results show that T-MTSCCA performs either better than or equally to the state-of-the-art methods. In particular, T-MTSCCA could identify higher canonical correlation coefficients and capture clearer canonical weight patterns. This suggests that T-MTSCCA identifies time-consistent and time-dependent SNPs and imaging QTs, which further help understand the genetic basis of the brain QT changes over the time during the disease progression. AVAILABILITY AND IMPLEMENTATION: The software and simulation data are publicly available at https://github.com/dulei323/TMTSCCA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/dulei323/TMTSCCA’]

A network-driven approach for genome-wide association mapping.

MOTIVATION: It remains a challenge to detect associations between genotypes and phenotypes because of insufficient sample sizes and complex underlying mechanisms involved in associations. Fortunately, it is becoming more feasible to obtain gene expression data in addition to genotypes and phenotypes, giving us new opportunities to detect true genotype-phenotype associations while unveiling their association mechanisms. RESULTS: In this article, we propose a novel method, NETAM, that accurately detects associations between SNPs and phenotypes, as well as gene traits involved in such associations. We take a network-driven approach: NETAM first constructs an association network, where nodes represent SNPs, gene traits or phenotypes, and edges represent the strength of association between two nodes. NETAM assigns a score to each path from an SNP to a phenotype, and then identifies significant paths based on the scores. In our simulation study, we show that NETAM finds significantly more phenotype-associated SNPs than traditional genotype-phenotype association analysis under false positive control, taking advantage of gene expression data. Furthermore, we applied NETAM on late-onset Alzheimer’s disease data and identified 477 significant path associations, among which we analyzed paths related to beta-amyloid, estrogen, and nicotine pathways. We also provide hypothetical biological pathways to explain our findings. AVAILABILITY AND IMPLEMENTATION: Software is available at http://www.sailing.cs.cmu.edu/ CONTACT: : epxing@cs.cmu.edu.

URL: [‘http://www.sailing.cs.cmu.edu/’]

Regional imaging genetic enrichment analysis.

MOTIVATION: Brain imaging genetics aims to reveal genetic effects on brain phenotypes, where most studies examine phenotypes defined on anatomical or functional regions of interest (ROIs) given their biologically meaningful interpretation and modest dimensionality compared with voxelwise approaches. Typical ROI-level measures used in these studies are summary statistics from voxelwise measures in the region, without making full use of individual voxel signals. RESULTS: In this article, we propose a flexible and powerful framework for mining regional imaging genetic associations via voxelwise enrichment analysis, which embraces the collective effect of weak voxel-level signals and integrates brain anatomical annotation information. Our proposed method achieves three goals at the same time: (i) increase the statistical power by substantially reducing the burden of multiple comparison correction; (ii) employ brain annotation information to enable biologically meaningful interpretation and (iii) make full use of fine-grained voxelwise signals. We demonstrate our method on an imaging genetic analysis using data from the Alzheimer’s Disease Neuroimaging Initiative, where we assess the collective regional genetic effects of voxelwise FDG-positron emission tomography measures between 116 ROIs and 565 373 single-nucleotide polymorphisms. Compared with traditional ROI-wise and voxelwise approaches, our method identified 2946 novel imaging genetic associations in addition to 33 ones overlapping with the two benchmark methods. In particular, two newly reported variants were further supported by transcriptome evidences from region-specific expression analysis. This demonstrates the promise of the proposed method as a flexible and powerful framework for exploring imaging genetic effects on the brain. AVAILABILITY AND IMPLEMENTATION: The R code and sample data are freely available at https://github.com/lshen/RIGEA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/lshen/RIGEA’]

High-throughput and efficient multilocus genome-wide association study on longitudinal outcomes.

MOTIVATION: With the emerging of high-dimensional genomic data, genetic analysis such as genome-wide association studies (GWAS) have played an important role in identifying disease-related genetic variants and novel treatments. Complex longitudinal phenotypes are commonly collected in medical studies. However, since limited analytical approaches are available for longitudinal traits, these data are often underutilized. In this article, we develop a high-throughput machine learning approach for multilocus GWAS using longitudinal traits by coupling Empirical Bayesian Estimates from mixed-effects modeling with a novel l0-norm algorithm. RESULTS: Extensive simulations demonstrated that the proposed approach not only provided accurate selection of single nucleotide polymorphisms (SNPs) with comparable or higher power but also robust control of false positives. More importantly, this novel approach is highly scalable and could be approximately >1000 times faster than recently published approaches, making genome-wide multilocus analysis of longitudinal traits possible. In addition, our proposed approach can simultaneously analyze millions of SNPs if the computer memory allows, thereby potentially allowing a true multilocus analysis for high-dimensional genomic data. With application to the data from Alzheimer’s Disease Neuroimaging Initiative, we confirmed that our approach can identify well-known SNPs associated with AD and were much faster than recently published approaches (>=6000 times). AVAILABILITY AND IMPLEMENTATION: The source code and the testing datasets are available at https://github.com/Myuan2019/EBE_APML0. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/Myuan2019/EBE_APML0’]

From phenotype to genotype: an association study of longitudinal phenotypic markers to Alzheimer’s disease relevant SNPs.

MOTIVATION: Imaging genetic studies typically focus on identifying single-nucleotide polymorphism (SNP) markers associated with imaging phenotypes. Few studies perform regression of SNP values on phenotypic measures for examining how the SNP values change when phenotypic measures are varied. This alternative approach may have a potential to help us discover important imaging genetic associations from a different perspective. In addition, the imaging markers are often measured over time, and this longitudinal profile may provide increased power for differentiating genotype groups. How to identify the longitudinal phenotypic markers associated to disease sensitive SNPs is an important and challenging research topic. RESULTS: Taking into account the temporal structure of the longitudinal imaging data and the interrelatedness among the SNPs, we propose a novel ‘task-correlated longitudinal sparse regression’ model to study the association between the phenotypic imaging markers and the genotypes encoded by SNPs. In our new association model, we extend the widely used l(2,1)-norm for matrices to tensors to jointly select imaging markers that have common effects across all the regression tasks and time points, and meanwhile impose the trace-norm regularization onto the unfolded coefficient tensor to achieve low rank such that the interrelationship among SNPs can be addressed. The effectiveness of our method is demonstrated by both clearly improved prediction performance in empirical evaluations and a compact set of selected imaging predictors relevant to disease sensitive SNPs. AVAILABILITY: Software is publicly available at: http://ranger.uta.edu/%7eheng/Longitudinal/ CONTACT: heng@uta.edu or shenli@inpui.edu.

URL: [‘http://ranger.uta.edu/%7eheng/Longitudinal/’]

Tissue-specific network-based genome wide study of amygdala imaging phenotypes to identify functional interaction modules.

MOTIVATION: Network-based genome-wide association studies (GWAS) aim to identify functional modules from biological networks that are enriched by top GWAS findings. Although gene functions are relevant to tissue context, most existing methods analyze tissue-free networks without reflecting phenotypic specificity. RESULTS: We propose a novel module identification framework for imaging genetic studies using the tissue-specific functional interaction network. Our method includes three steps: (i) re-prioritize imaging GWAS findings by applying machine learning methods to incorporate network topological information and enhance the connectivity among top genes; (ii) detect densely connected modules based on interactions among top re-prioritized genes; and (iii) identify phenotype-relevant modules enriched by top GWAS findings. We demonstrate our method on the GWAS of [18F]FDG-PET measures in the amygdala region using the imaging genetic data from the Alzheimer’s Disease Neuroimaging Initiative, and map the GWAS results onto the amygdala-specific functional interaction network. The proposed network-based GWAS method can effectively detect densely connected modules enriched by top GWAS findings. Tissue-specific functional network can provide precise context to help explore the collective effects of genes with biologically meaningful interactions specific to the studied phenotype. AVAILABILITY AND IMPLEMENTATION: The R code and sample data are freely available at http://www.iu.edu/shenlab/tools/gwasmodule/. CONTACT: shenli@iu.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘http://www.iu.edu/shenlab/tools/gwasmodule/’]

Identification of associations between genotypes and longitudinal phenotypes via temporally-constrained group sparse canonical correlation analysis.

MOTIVATION: Neuroimaging genetics identifies the relationships between genetic variants (i.e., the single nucleotide polymorphisms) and brain imaging data to reveal the associations from genotypes to phenotypes. So far, most existing machine-learning approaches are widely used to detect the effective associations between genetic variants and brain imaging data at one time-point. However, those associations are based on static phenotypes and ignore the temporal dynamics of the phenotypical changes. The phenotypes across multiple time-points may exhibit temporal patterns that can be used to facilitate the understanding of the degenerative process. In this article, we propose a novel temporally constrained group sparse canonical correlation analysis (TGSCCA) framework to identify genetic associations with longitudinal phenotypic markers. RESULTS: The proposed TGSCCA method is able to capture the temporal changes in brain from longitudinal phenotypes by incorporating the fused penalty, which requires that the differences between two consecutive canonical weight vectors from adjacent time-points should be small. A new efficient optimization algorithm is designed to solve the objective function. Furthermore, we demonstrate the effectiveness of our algorithm on both synthetic and real data (i.e., the Alzheimer’s Disease Neuroimaging Initiative cohort, including progressive mild cognitive impairment, stable MCI and Normal Control participants). In comparison with conventional SCCA, our proposed method can achieve strong associations and discover phenotypic biomarkers across multiple time-points to guide disease-progressive interpretation. AVAILABILITY AND IMPLEMENTATION: The Matlab code is available at https://sourceforge.net/projects/ibrain-cn/files/ . CONTACT: dqzhang@nuaa.edu.cn or shenli@iu.edu.

URL: [‘https://sourceforge.net/projects/ibrain-cn/files/’]

Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort.

MOTIVATION: Recent advances in high-throughput genotyping and brain imaging techniques enable new approaches to study the influence of genetic variation on brain structures and functions. Traditional association studies typically employ independent and pairwise univariate analysis, which treats single nucleotide polymorphisms (SNPs) and quantitative traits (QTs) as isolated units and ignores important underlying interacting relationships between the units. New methods are proposed here to overcome this limitation. RESULTS: Taking into account the interlinked structure within and between SNPs and imaging QTs, we propose a novel Group-Sparse Multi-task Regression and Feature Selection (G-SMuRFS) method to identify quantitative trait loci for multiple disease-relevant QTs and apply it to a study in mild cognitive impairment and Alzheimer’s disease. Built upon regression analysis, our model uses a new form of regularization, group l(2,1)-norm (G(2,1)-norm), to incorporate the biological group structures among SNPs induced from their genetic arrangement. The new G(2,1)-norm considers the regression coefficients of all the SNPs in each group with respect to all the QTs together and enforces sparsity at the group level. In addition, an l(2,1)-norm regularization is utilized to couple feature selection across multiple tasks to make use of the shared underlying mechanism among different brain regions. The effectiveness of the proposed method is demonstrated by both clearly improved prediction performance in empirical evaluations and a compact set of selected SNP predictors relevant to the imaging QTs. AVAILABILITY: Software is publicly available at: http://ranger.uta.edu/%7eheng/imaging-genetics/.

URL: [‘http://ranger.uta.edu/%7eheng/imaging-genetics/’]

DataRemix: a universal data transformation for optimal inference from gene expression datasets.

MOTIVATION: RNA-seq technology provides unprecedented power in the assessment of the transcription abundance and can be used to perform a variety of downstream tasks such as inference of gene-correlation network and eQTL discovery. However, raw gene expression values have to be normalized for nuisance biological variation and technical covariates, and different normalization strategies can lead to dramatically different results in the downstream study. RESULTS: We describe a generalization of singular value decomposition-based reconstruction for which the common techniques of whitening, rank-k approximation and removing the top k principal components are special cases. Our simple three-parameter transformation, DataRemix, can be tuned to reweigh the contribution of hidden factors and reveal otherwise hidden biological signals. In particular, we demonstrate that the method can effectively prioritize biological signals over noise without leveraging external dataset-specific knowledge, and can outperform normalization methods that make explicit use of known technical factors. We also show that DataRemix can be efficiently optimized via Thompson sampling approach, which makes it feasible for computationally expensive objectives such as eQTL analysis. Finally, we apply our method to the Religious Orders Study and Memory and Aging Project dataset, and we report what to our knowledge is the first replicable trans-eQTL effect in human brain. AVAILABILITYAND IMPLEMENTATION: DataRemix is an R package which is freely available at GitHub (https://github.com/wgmao/DataRemix). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/wgmao/DataRemix’]

DLRS: gene tree evolution in light of a species tree.

SUMMARY: PrIME-DLRS (or colloquially: ‘Delirious’) is a phylogenetic software tool to simultaneously infer and reconcile a gene tree given a species tree. It accounts for duplication and loss events, a relaxed molecular clock and is intended for the study of homologous gene families, for example in a comparative genomics setting involving multiple species. PrIME-DLRS uses a Bayesian MCMC framework, where the input is a known species tree with divergence times and a multiple sequence alignment, and the output is a posterior distribution over gene trees and model parameters. AVAILABILITY AND IMPLEMENTATION: PrIME-DLRS is available for Java SE 6+ under the New BSD License, and JAR files and source code can be downloaded from http://code.google.com/p/jprime/. There is also a slightly older C++ version available as a binary package for Ubuntu, with download instructions at http://prime.sbc.su.se. The C++ source code is available upon request. CONTACT: joel.sjostrand@scilifelab.se or jens.lagergren@scilifelab.se. SUPPLEMENTARY INFORMATION: PrIME-DLRS is based on a sound probabilistic model (Akerborg et al., 2009) and has been thoroughly validated on synthetic and biological datasets (Supplementary Material online).

URL: [‘http://code.google.com/p/jprime/’, ‘http://prime.sbc.su.se’]

PRIMe: a method for characterization and evaluation of pleiotropic regions from multiple genome-wide association studies.

MOTIVATION: The concept of pleiotropy was proposed a century ago, though up to now there have been insufficient efforts to design robust statistics and software aimed at visualizing and evaluating pleiotropy at a regional level. The Pleiotropic Region Identification Method (PRIMe) was developed to evaluate potentially pleiotropic loci based upon data from multiple genome-wide association studies (GWAS). METHODS: We first provide a software tool to systematically identify and characterize genomic regions where low association P-values are observed with multiple traits. We use the term Pleiotropy Index to denote the number of traits with low association P-values at a particular genomic region. For GWAS assumed to be uncorrelated, we adopted the binomial distribution to approximate the statistical significance of the Pleiotropy Index. For GWAS conducted on traits with known correlation coefficients, simulations are performed to derive the statistical distribution of the Pleiotropy Index under the null hypothesis of no genotype-phenotype association. For six hematologic and three blood pressure traits where full GWAS results were available from the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, we estimated the trait correlations and applied the simulation approach to examine genomic regions with statistical evidence of pleiotropy. We then applied the approximation approach to explore GWAS summarized in the National Human Genome Research Institute (NHGRI) GWAS Catalog. RESULTS: By simulation, we identified pleiotropic regions including SH2B3 and BRAP (12q24.12) for hematologic and blood pressure traits. By approximation, we confirmed the genome-wide significant pleiotropy of these two regions based on the GWAS Catalog data, together with an exploration on other regions which highlights the FTO, GCKR and ABO regions. AVAILABILITY AND IMPLEMENTATION: The Perl and R scripts are available at http://www.framinghamheartstudy.org/research/gwas_pleiotropictool.html.

URL: [‘http://www.framinghamheartstudy.org/research/gwas_pleiotropictool.html’]

CoMM: a collaborative mixed model to dissecting genetic contributions to complex traits by leveraging regulatory information.

MOTIVATION: Genome-wide association studies (GWASs) have been successful in identifying many genetic variants associated with complex traits. However, the mechanistic links between these variants and complex traits remain elusive. A scientific hypothesis is that genetic variants influence complex traits at the organismal level via affecting cellular traits, such as regulating gene expression and altering protein abundance. Although earlier works have already presented some scientific insights about this hypothesis and their findings are very promising, statistical methods that effectively harness multilayered data (e.g. genetic variants, cellular traits and organismal traits) on a large scale for functional and mechanistic exploration are highly demanding. RESULTS: In this study, we propose a collaborative mixed model (CoMM) to investigate the mechanistic role of associated variants in complex traits. The key idea is built upon the emerging scientific evidence that genetic effects at the cellular level are much stronger than those at the organismal level. Briefly, CoMM combines two models: the first model relating gene expression with genotype and the second model relating phenotype with predicted gene expression using the first model. The two models are fitted jointly in CoMM, such that the uncertainty in predicting gene expression has been fully accounted. To demonstrate the advantages of CoMM over existing methods, we conducted extensive simulation studies, and also applied CoMM to analyze 25 traits in NFBC1966 and Genetic Epidemiology Research on Aging (GERA) studies by integrating transcriptome information from the Genetic European in Health and Disease (GEUVADIS) Project. The results indicate that by leveraging regulatory information, CoMM can effectively improve the power of prioritizing risk variants. Regarding the computational efficiency, CoMM can complete the analysis of NFBC1966 dataset and GERA datasets in 2 and 18 min, respectively. AVAILABILITY AND IMPLEMENTATION: The developed R package is available at https://github.com/gordonliu810822/CoMM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/gordonliu810822/CoMM’]

CellTICS: an explainable neural network for cell-type identification and interpretation based on single-cell RNA-seq data.

Identifying cell types is crucial for understanding the functional units of an organism. Machine learning has shown promising performance in identifying cell types, but many existing methods lack biological significance due to poor interpretability. However, it is of the utmost importance to understand what makes cells share the same function and form a specific cell type, motivating us to propose a biologically interpretable method. CellTICS prioritizes marker genes with cell-type-specific expression, using a hierarchy of biological pathways for neural network construction, and applying a multi-predictive-layer strategy to predict cell and sub-cell types. CellTICS usually outperforms existing methods in prediction accuracy. Moreover, CellTICS can reveal pathways that define a cell type or a cell type under specific physiological conditions, such as disease or aging. The nonlinear nature of neural networks enables us to identify many novel pathways. Interestingly, some of the pathways identified by CellTICS exhibit differential expression “variability” rather than differential expression across cell types, indicating that expression stochasticity within a pathway could be an important feature characteristic of a cell type. Overall, CellTICS provides a biologically interpretable method for identifying and characterizing cell types, shedding light on the underlying pathways that define cellular heterogeneity and its role in organismal function. CellTICS is available at https://github.com/qyyin0516/CellTICS.

URL: [‘https://github.com/qyyin0516/CellTICS’]

Diagnostic Evidence GAuge of Single cells (DEGAS): a flexible deep transfer learning framework for prioritizing cells in relation to disease.

We propose DEGAS (Diagnostic Evidence GAuge of Single cells), a novel deep transfer learning framework, to transfer disease information from patients to cells. We call such transferrable information “impressions,” which allow individual cells to be associated with disease attributes like diagnosis, prognosis, and response to therapy. Using simulated data and ten diverse single-cell and patient bulk tissue transcriptomic datasets from glioblastoma multiforme (GBM), Alzheimer’s disease (AD), and multiple myeloma (MM), we demonstrate the feasibility, flexibility, and broad applications of the DEGAS framework. DEGAS analysis on myeloma single-cell transcriptomics identified PHF19high myeloma cells associated with progression. Availability: https://github.com/tsteelejohnson91/DEGAS .

URL: [‘https://github.com/tsteelejohnson91/DEGAS’]

scGRNom: a computational pipeline of integrative multi-omics analyses for predicting cell-type disease genes and regulatory networks.

Understanding cell-type-specific gene regulatory mechanisms from genetic variants to diseases remains challenging. To address this, we developed a computational pipeline, scGRNom (single-cell Gene Regulatory Network prediction from multi-omics), to predict cell-type disease genes and regulatory networks including transcription factors and regulatory elements. With applications to schizophrenia and Alzheimer’s disease, we predicted disease genes and regulatory networks for excitatory and inhibitory neurons, microglia, and oligodendrocytes. Further enrichment analyses revealed cross-disease and disease-specific functions and pathways at the cell-type level. Our machine learning analysis also found that cell-type disease genes improved clinical phenotype predictions. scGRNom is a general-purpose tool available at https://github.com/daifengwanglab/scGRNom .

URL: [‘https://github.com/daifengwanglab/scGRNom’]

CosGeneGate selects multi-functional and credible biomarkers for single-cell analysis.

MOTIVATION: Selecting representative genes or marker genes to distinguish cell types is an important task in single-cell sequencing analysis. Although many methods have been proposed to select marker genes, the genes selected may have redundancy and/or do not show cell-type-specific expression patterns to distinguish cell types. RESULTS: Here, we present a novel model, named CosGeneGate, to select marker genes for more effective marker selections. CosGeneGate is inspired by combining the advantages of selecting marker genes based on both cell-type classification accuracy and marker gene specific expression patterns. We demonstrate the better performance of the marker genes selected by CosGeneGate for various downstream analyses than the existing methods with both public datasets and newly sequenced datasets. The non-redundant marker genes identified by CosGeneGate for major cell types and tissues in human can be found at the website as follows: https://github.com/VivLon/CosGeneGate/blob/main/marker gene list.xlsx.

URL: [‘https://github.com/VivLon/CosGeneGate/blob/main/marker’]

Deep5hmC: Predicting genome-wide 5-Hydroxymethylcytosine landscape via a multimodal deep learning model.

MOTIVATION: 5-hydroxymethylcytosine (5hmC), a crucial epigenetic mark with a significant role in regulating tissue-specific gene expression, is essential for understanding the dynamic functions of the human genome. Despite its importance, predicting 5hmC modification across the genome remains a challenging task, especially when considering the complex interplay between DNA sequences and various epigenetic factors such as histone modifications and chromatin accessibility. RESULTS: Using tissue-specific 5hmC sequencing data, we introduce Deep5hmC, a multimodal deep learning framework that integrates both the DNA sequence and epigenetic features such as histone modification and chromatin accessibility to predict genome-wide 5hmC modification. The multimodal design of Deep5hmC demonstrates remarkable improvement in predicting both qualitative and quantitative 5hmC modification compared to unimodal versions of Deep5hmC and state-of-the-art machine learning methods. This improvement is demonstrated through benchmarking on a comprehensive set of 5hmC sequencing data collected at four developmental stages during forebrain organoid development and across 17 human tissues. Compared to DeepSEA and random forest, Deep5hmC achieves close 4 % and 17% improvement of AUROC across four forebrain developmental stages, and 6% and 27% across 17 human tissues for predicting binary 5hmC modification sites; and 8% and 22% improvement of Spearman correlation coefficient across four forebrain developmental stages, and 17% and 30% across 17 human tissues for predicting continuous 5hmC modification. Notably, Deep5hmC showcases its practical utility by accurately predicting gene expression and identifying differentially hydroxymethylated regions in a case-control study of Alzheimer’s disease. Deep5hmC significantly improves our understanding of tissue-specific gene regulation and facilitates the development of new biomarkers for complex diseases. AVAILABILITY AND IMPLEMENTATION: Deep5hmC is available via https://github.com/lichen-lab/Deep5hmC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/lichen-lab/Deep5hmC’]

DiSTect: a Bayesian model for disease-associated gene discovery and prediction in spatial transcriptomics.

MOTIVATION: Identifying disease-indicative genes is critical for deciphering disease mechanisms and has attracted significant interest in biomedical research. Spatial transcriptomics offers unprecedented insights for the detection of disease-associated genes by enabling within-tissue contrasts. However, this new technology poses challenges for conventional statistical models developed for RNA-sequencing, as these models often neglect the spatial organization of tissue spots. RESULTS: In this article, we propose DiSTect, a Bayesian shrinkage model to characterize the relationship between high-dimensional gene expressions and the disease status of each tissue spot, incorporating spatial correlation among these spots through autoregressive terms. Our model adopts a hierarchical structure to facilitate the analysis of multiple correlated samples and is further extended to accommodate the missing data within tissues. To ensure the model’s applicability to datasets of varying sizes, we carry out two computational frameworks for Bayesian parameter estimation, tailored to both small and large sample scenarios. Simulation studies are conducted to evaluate the performance of the proposed model. The proposed model is applied to analyze the data arising from studies of HER2+ breast cancer and Alzheimer’s disease. AVAILABILITY AND IMPLEMENTATION: The dataset and source code are available on GitHub (https://github.com/StaGill/DiSTect) and Zenodo (https://zenodo.org/records/17127211). SUPPLEMENTARY INFORMATION: Supplementary data are provided online at Bioinformatics.

URL: [‘https://github.com/StaGill/DiSTect’, ‘https://zenodo.org/records/17127211’]

Differential abundance testing on single-cell data using k-nearest neighbor graphs.

Current computational workflows for comparative analyses of single-cell datasets typically use discrete clusters as input when testing for differential abundance among experimental conditions. However, clusters do not always provide the appropriate resolution and cannot capture continuous trajectories. Here we present Milo, a scalable statistical framework that performs differential abundance testing by assigning cells to partially overlapping neighborhoods on a k-nearest neighbor graph. Using simulations and single-cell RNA sequencing (scRNA-seq) data, we show that Milo can identify perturbations that are obscured by discretizing cells into clusters, that it maintains false discovery rate control across batch effects and that it outperforms alternative differential abundance testing strategies. Milo identifies the decline of a fate-biased epithelial precursor in the aging mouse thymus and identifies perturbations to multiple lineages in human cirrhotic liver. As Milo is based on a cell-cell similarity structure, it might also be applicable to single-cell data other than scRNA-seq. Milo is provided as an open-source R software package at https://github.com/MarioniLab/miloR .

URL: [‘https://github.com/MarioniLab/miloR’]

Self-supervised semantic segmentation of retinal pigment epithelium cells in flatmount fluorescent microscopy images.

MOTIVATION: Morphological analyses with flatmount fluorescent images are essential to retinal pigment epithelial (RPE) aging studies and thus require accurate RPE cell segmentation. Although rapid technology advances in deep learning semantic segmentation have achieved great success in many biomedical research, the performance of these supervised learning methods for RPE cell segmentation is still limited by inadequate training data with high-quality annotations. RESULTS: To address this problem, we develop a Self-Supervised Semantic Segmentation (S4) method that utilizes a self-supervised learning strategy to train a semantic segmentation network with an encoder-decoder architecture. We employ a reconstruction and a pairwise representation loss to make the encoder extract structural information, while we create a morphology loss to produce the segmentation map. In addition, we develop a novel image augmentation algorithm (AugCut) to produce multiple views for self-supervised learning and enhance the network training performance. To validate the efficacy of our method, we applied our developed S4 method for RPE cell segmentation to a large set of flatmount fluorescent microscopy images, we compare our developed method for RPE cell segmentation with other state-of-the-art deep learning approaches. Compared with other state-of-the-art deep learning approaches, our method demonstrates better performance in both qualitative and quantitative evaluations, suggesting its promising potential to support large-scale cell morphological analyses in RPE aging investigations. AVAILABILITY AND IMPLEMENTATION: The codes and the documentation are available at: https://github.com/jkonglab/S4_RPE.

URL: [‘https://github.com/jkonglab/S4_RPE’]

DiSC: a Statistical Tool for Fast Differential Expression Analysis of Individual-level Single-cell RNA-seq Data.

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) has become an important method for characterizing cellular heterogeneity, revealing more biological insights than the bulk RNA-seq. The surge in scRNA-seq data across multiple individuals calls for efficient and statistically powerful methods for differential expression (DE) analysis that addresses individual-level biological variability. RESULTS: We introduced DiSC, a method for conducting individual-level DE analysis by extracting multiple distributional characteristics, jointly testing their association with a variable of interest, and using a flexible permutation testing framework to control the false discovery rate (FDR). Our simulation studies demonstrated that DiSC effectively controlled the FDR across various settings and exhibited high statistical power in detecting different types of gene expression changes. Moreover, DiSC is computationally efficient and scalable to the rapidly increasing sample sizes in scRNA-seq studies. When applying DiSC to identify DE genes potentially associated with COVID-19 severity and Alzheimer’s disease across various types of peripheral blood mononuclear cells and neural cells, we found that our method was approximately 100 times faster than other state-of-the-art methods and the results were consistent and supported by existing literature. While DiSC was developed for scRNA-seq data, its robust testing framework can also be applied to other types of single-cell data. We applied DiSC to cytometry by time-of-flight data, DiSC identified significantly more DE markers than traditional methods. AVAILABILITY: The R software package “SingleCellStat” is freely available on CRAN (https://cran.r-project.org/web/packages/SingleCellStat/index.html) and GitHub (https://github.com/Lujun995/DiSC). The replication code for reproducing the analyses in this study is publicly accessible at https://github.com/Lujun995/DiSC_Replication_Code. SUPPLEMENTARY INFORMATION: The scRNA-seq expression matrix and metadata utilized in our simulations and analyses can be retrieved from https://cells.ucsc.edu/autism/rawMatrix.zip, https://cellxgene.cziscience.com/collections/1ca90a2d-2943-483d-b678-b809bf464c30, and https://covid19.cog.sanger.ac.uk/submissions/release1/haniffa21.processed.h5ad. Supplementary data are available at Bioinformatics online.

URL: [‘https://cran.r-project.org/web/packages/SingleCellStat/index.html’, ‘https://github.com/Lujun995/DiSC’, ‘https://github.com/Lujun995/DiSC_Replication_Code’, ‘https://cells.ucsc.edu/autism/rawMatrix.zip’, ‘https://cellxgene.cziscience.com/collections/1ca90a2d-2943-483d-b678-b809bf464c30’, ‘https://covid19.cog.sanger.ac.uk/submissions/release1/haniffa21.processed.h5ad’]

Recovering time-varying networks from single-cell data.

MOTIVATION: Gene regulation is a dynamic process that underlies all aspects of human development, disease response, and other biological processes. The reconstruction of temporal gene regulatory networks has conventionally relied on regression analysis, graphical models, or other types of relevance networks. With the large increase in time series single-cell data, new approaches are needed to address the unique scale and nature of these data for reconstructing such networks. RESULTS: Here, we develop a deep neural network, Marlene, to infer dynamic graphs from time series single-cell gene expression data. Marlene constructs directed gene networks using a self-attention mechanism where the weights evolve over time using recurrent units. By employing meta learning, the model is able to recover accurate temporal networks even for rare cell types. In addition, Marlene can identify gene interactions relevant to specific biological responses, including COVID-19 immune response, fibrosis, and aging, paving the way for potential treatments. AVAILABILITY AND IMPLEMENTATION: The code used to train Marlene is available at https://github.com/euxhenh/Marlene.

URL: [‘https://github.com/euxhenh/Marlene’]

Learning meaningful representation of single-neuron morphology via large-scale pre-training.

SUMMARY: Single-neuron morphology, the study of the structure, form, and shape of a group of specialized cells in the nervous system, is of vital importance to define the type of neurons, assess changes in neuronal development and aging and determine the effects of brain disorders and treatments. Despite the recent surge in the amount of available neuron morphology reconstructions due to advancements in microscopy imaging, existing computational and deep learning methods for modeling neuron morphology have been limited in both scale and accuracy. In this paper, we propose MorphRep, a model for learning meaningful representation of neuron morphology pre-trained with over 250 000 existing neuron morphology data. By encoding the neuron morphology into graph-structured data, using graph transformers for feature encoding and enforcing the consistency between multiple augmented views of neuron morphology, MorphRep achieves the state of the art performance on widely used benchmarking datasets. Meanwhile, MorphRep can accurately characterize the neuron morphology space across neuron morphometrics, fine-grained cell types, brain regions and ages. Furthermore, MorphRep can be applied to distinguish neurons under a wide range of conditions, including genetic perturbation, drug injection, environment change and disease. In summary, MorphRep provides an effective strategy to embed and represent neuron morphology and can be a valuable tool in integrating cell morphology into single-cell multiomics analysis. AVAILABILITY AND IMPLEMENTATION: The codebase has been deposited in https://github.com/YaxuanLi-cn/MorphRep.

URL: [‘https://github.com/YaxuanLi-cn/MorphRep’]

Optimal Gene Filtering for Single-Cell data (OGFSC)-a gene filtering algorithm for single-cell RNA-seq data.

MOTIVATION: Single-cell transcriptomic data are commonly accompanied by extremely high technical noise due to the low RNA concentrations from individual cells. Precise identification of differentially expressed genes and cell populations are heavily dependent on the effective reduction of technical noise, e.g. by gene filtering. However, there is still no well-established standard in the current approaches of gene filtering. Investigators usually filter out genes based on single fixed threshold, which commonly leads to both over- and under-stringent errors. RESULTS: In this study, we propose a novel algorithm, termed as Optimal Gene Filtering for Single-Cell data, to construct a thresholding curve based on gene expression levels and the corresponding variances. We validated our method on multiple single-cell RNA-seq datasets, including simulated and published experimental datasets. The results show that the known signal and known noise are reliably discriminated in the simulated datasets. In addition, the results of seven experimental datasets demonstrate that these cells of the same annotated types are more sharply clustered using our method. Interestingly, when we re-analyze the dataset from an aging research recently published in Science, we find a list of regulated genes which is different from that reported in the original study, because of using different filtering methods. However, the knowledge based on our findings better matches the progression of immunosenescence. In summary, we here provide an alternative opportunity to probe into the true level of technical noise in single-cell transcriptomic data. AVAILABILITY AND IMPLEMENTATION: https://github.com/XZouProjects/OGFSC.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/XZouProjects/OGFSC.git’]

A single-cell and spatial RNA-seq database for Alzheimer’s disease (ssREAD).

Alzheimer’s Disease (AD) pathology has been increasingly explored through single-cell and single-nucleus RNA-sequencing (scRNA-seq & snRNA-seq) and spatial transcriptomics (ST). However, the surge in data demands a comprehensive, user-friendly repository. Addressing this, we introduce a single-cell and spatial RNA-seq database for Alzheimer’s disease (ssREAD). It offers a broader spectrum of AD-related datasets, an optimized analytical pipeline, and improved usability. The database encompasses 1,053 samples (277 integrated datasets) from 67 AD-related scRNA-seq & snRNA-seq studies, totaling 7,332,202 cells. Additionally, it archives 381 ST datasets from 18 human and mouse brain studies. Each dataset is annotated with details such as species, gender, brain region, disease/control status, age, and AD Braak stages. ssREAD also provides an analysis suite for cell clustering, identification of differentially expressed and spatially variable genes, cell-type-specific marker genes and regulons, and spot deconvolution for integrative analysis. ssREAD is freely available at https://bmblx.bmi.osumc.edu/ssread/ .

URL: [‘https://bmblx.bmi.osumc.edu/ssread/’]

ZEBRA: a hierarchically integrated gene expression atlas of the murine and human brain at single-cell resolution.

The molecular causes and mechanisms of neurodegenerative diseases remain poorly understood. A growing number of single-cell studies have implicated various neural, glial, and immune cell subtypes to affect the mammalian central nervous system in many age-related disorders. Integrating this body of transcriptomic evidence into a comprehensive and reproducible framework poses several computational challenges. Here, we introduce ZEBRA, a large single-cell and single-nucleus RNA-seq database. ZEBRA integrates and normalizes gene expression and metadata from 33 studies, encompassing 4.2 million human and mouse brain cells sampled from 39 brain regions. It incorporates samples from patients with neurodegenerative diseases like Alzheimer’s disease, Parkinson’s disease, and Multiple sclerosis, as well as samples from relevant mouse models. We employed scVI, a deep probabilistic auto-encoder model, to integrate the samples and curated both cell and sample metadata for downstream analysis. ZEBRA allows for cell-type and disease-specific markers to be explored and compared between sample conditions and brain regions, a cell composition analysis, and gene-wise feature mappings. Our comprehensive molecular database facilitates the generation of data-driven hypotheses, enhancing our understanding of mammalian brain function during aging and disease. The data sets, along with an interactive database are freely available at https://www.ccb.uni-saarland.de/zebra.

URL: [‘https://www.ccb.uni-saarland.de/zebra’]

SCAN: Spatiotemporal Cloud Atlas for Neural cells.

The nervous system is one of the most complicated and enigmatic systems within the animal kingdom. Recently, the emergence and development of spatial transcriptomics (ST) and single-cell RNA sequencing (scRNA-seq) technologies have provided an unprecedented ability to systematically decipher the cellular heterogeneity and spatial locations of the nervous system from multiple unbiased aspects. However, efficiently integrating, presenting and analyzing massive multiomic data remains a huge challenge. Here, we manually collected and comprehensively analyzed high-quality scRNA-seq and ST data from the nervous system, covering 10 679 684 cells. In addition, multi-omic datasets from more than 900 species were included for extensive data mining from an evolutionary perspective. Furthermore, over 100 neurological diseases (e.g. Alzheimer’s disease, Parkinson’s disease, Down syndrome) were systematically analyzed for high-throughput screening of putative biomarkers. Differential expression patterns across developmental time points, cell types and ST spots were discerned and subsequently subjected to extensive interpretation. To provide researchers with efficient data exploration, we created a new database with interactive interfaces and integrated functions called the Spatiotemporal Cloud Atlas for Neural cells (SCAN), freely accessible at http://47.98.139.124:8799 or http://scanatlas.net. SCAN will benefit the neuroscience research community to better exploit the spatiotemporal atlas of the neural system and promote the development of diagnostic strategies for various neurological disorders.

URL: [‘http://47.98.139.124:8799’, ‘http://scanatlas.net’]

A single-cell atlas to map sex-specific gene-expression changes in blood upon neurodegeneration.

The clinical course and treatment of neurodegenerative disease are complicated by immune-system interference and chronic inflammatory processes, which remain incompletely understood. Mapping immune signatures in larger human cohorts through single-cell gene expression profiling supports our understanding of observed peripheral changes in neurodegeneration. Here, we employ single-cell gene expression profiling of over 909k peripheral blood mononuclear cells (PBMCs) from 121 healthy individuals, 48 patients with mild cognitive impairment (MCI), 46 with Parkinson’s disease (PD), 27 with Alzheimer’s disease (AD), and 15 with both PD and MCI. The dataset is interactively accessible through a freely available website ( https://www.ccb.uni-saarland.de/adrcsc ). In this work, we identify disease-associated changes in blood cell type composition and the gene expression in a sex-specific manner, offering insights into peripheral and solid tissue signatures in AD and PD.

URL: [‘https://www.ccb.uni-saarland.de/adrcsc’]

STAB2: an updated spatio-temporal cell atlas of the human and mouse brain.

The brain is constituted of heterogeneous types of neuronal and non-neuronal cells, which are organized into distinct anatomical regions, and show precise regulation of gene expression during development, aging and function. In the current database release, STAB2 provides a systematic cellular map of the human and mouse brain by integrating recently published large-scale single-cell and single-nucleus RNA-sequencing datasets from diverse regions and across lifespan. We applied a hierarchical strategy of unsupervised clustering on the integrated single-cell transcriptomic datasets to precisely annotate the cell types and subtypes in the human and mouse brain. Currently, STAB2 includes 71 and 61 different cell subtypes defined in the human and mouse brain, respectively. It covers 63 subregions and 15 developmental stages of human brain, and 38 subregions and 30 developmental stages of mouse brain, generating a comprehensive atlas for exploring spatiotemporal transcriptomic dynamics in the mammalian brain. We also augmented web interfaces for querying and visualizing the gene expression in specific cell types. STAB2 is freely available at https://mai.fudan.edu.cn/stab2.

URL: [‘https://mai.fudan.edu.cn/stab2’]

AD-Syn-Net: systematic identification of Alzheimer’s disease-associated mutation and co-mutation vulnerabilities via deep learning.

Alzheimer’s disease (AD) is one of the most challenging neurodegenerative diseases because of its complicated and progressive mechanisms, and multiple risk factors. Increasing research evidence demonstrates that genetics may be a key factor responsible for the occurrence of the disease. Although previous reports identified quite a few AD-associated genes, they were mostly limited owing to patient sample size and selection bias. There is a lack of comprehensive research aimed to identify AD-associated risk mutations systematically. To address this challenge, we hereby construct a large-scale AD mutation and co-mutation framework (‘AD-Syn-Net’), and propose deep learning models named Deep-SMCI and Deep-CMCI configured with fully connected layers that are capable of predicting cognitive impairment of subjects effectively based on genetic mutation and co-mutation profiles. Next, we apply the customized frameworks to data sets to evaluate the importance scores of the mutations and identified mutation effectors and co-mutation combination vulnerabilities contributing to cognitive impairment. Furthermore, we evaluate the influence of mutation pairs on the network architecture to dissect the genetic organization of AD and identify novel co-mutations that could be responsible for dementia, laying a solid foundation for proposing future targeted therapy for AD precision medicine. Our deep learning model codes are available open access here: https://github.com/Pan-Bio/AD-mutation-effectors.

URL: [‘https://github.com/Pan-Bio/AD-mutation-effectors’]

VCPA: genomic variant calling pipeline and data management tool for Alzheimer’s Disease Sequencing Project.

SUMMARY: We report VCPA, our SNP/Indel Variant Calling Pipeline and data management tool used for the analysis of whole genome and exome sequencing (WGS/WES) for the Alzheimer’s Disease Sequencing Project. VCPA consists of two independent but linkable components: pipeline and tracking database. The pipeline, implemented using the Workflow Description Language and fully optimized for the Amazon elastic compute cloud environment, includes steps from aligning raw sequence reads to variant calling using GATK. The tracking database allows users to view job running status in real time and visualize >100 quality metrics per genome. VCPA is functionally equivalent to the CCDG/TOPMed pipeline. Users can use the pipeline and the dockerized database to process large WGS/WES datasets on Amazon cloud with minimal configuration. AVAILABILITY AND IMPLEMENTATION: VCPA is released under the MIT license and is available for academic and nonprofit use for free. The pipeline source code and step-by-step instructions are available from the National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (http://www.niagads.org/VCPA). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘http://www.niagads.org/VCPA’]

Functional annotation of genomic variants in studies of late-onset Alzheimer’s disease.

Motivation: Annotation of genomic variants is an increasingly important and complex part of the analysis of sequence-based genomic analyses. Computational predictions of variant function are routinely incorporated into gene-based analyses of rare-variants, though to date most studies use limited information for assessing variant function that is often agnostic of the disease being studied. Results: In this work, we outline an annotation process motivated by the Alzheimer’s Disease Sequencing Project, illustrate the impact of including tissue-specific transcript sets and sources of gene regulatory information and assess the potential impact of changing genomic builds on the annotation process. While these factors only impact a small proportion of total variant annotations (~5%), they influence the potential analysis of a large fraction of genes (~25%). Availability and implementation: Individual variant annotations are available via the NIAGADS GenomicsDB, at https://www.niagads.org/genomics/ tools-and-software/databases/genomics-database. Annotations are also available for bulk download at https://www.niagads.org/datasets. Annotation processing software is available at http://www.icompbio.net/resources/software-and-downloads/. Supplementary information: Supplementary data are available at Bioinformatics online.

URL: [‘https://www.niagads.org/genomics/’, ‘https://www.niagads.org/datasets’, ‘http://www.icompbio.net/resources/software-and-downloads/’]

BEATRICE: Bayesian Fine-mapping from Summary Data using Deep Variational Inference.

MOTIVATION: We introduce a novel framework BEATRICE to identify putative causal variants from GWAS statistics. Identifying causal variants is challenging due to their sparsity and high correlation in the nearby regions. To account for these challenges, we rely on a hierarchical Bayesian model that imposes a binary concrete prior on the set of causal variants. We derive a variational algorithm for this fine-mapping problem by minimizing the KL divergence between an approximate density and the posterior probability distribution of the causal configurations. Correspondingly, we use a deep neural network as an inference machine to estimate the parameters of our proposal distribution. Our stochastic optimization procedure allows us to sample from the space of causal configurations, which we use to compute the posterior inclusion probabilities and determine credible sets for each causal variant. We conduct a detailed simulation study to quantify the performance of our framework against two state-of-the-art baseline methods across different numbers of causal variants and noise paradigms, as defined by the relative genetic contributions of causal and non-causal variants. RESULTS: We demonstrate that BEATRICE achieves uniformly better coverage with comparable power and set sizes, and that the performance gain increases with the number of causal variants. We also show the efficacy BEATRICE in finding causal variants from the GWAS study of Alzheimer’s disease. In comparison to the baselines, only BEATRICE can successfully find the APOE epsilon2 allele, a commonly associated variant of Alzheimer’s. AVAILABILITY: BEATRICE is available for download at https://github.com/sayangsep/Beatrice-Finemapping. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/sayangsep/Beatrice-Finemapping’]

Dementia key gene identification with multi-layered SNP-gene-disease network.

MOTIVATION: Recently, various approaches for diagnosing and treating dementia have received significant attention, especially in identifying key genes that are crucial for dementia. If the mutations of such key genes could be tracked, it would be possible to predict the time of onset of dementia and significantly aid in developing drugs to treat dementia. However, gene finding involves tremendous cost, time and effort. To alleviate these problems, research on utilizing computational biology to decrease the search space of candidate genes is actively conducted. In this study, we propose a framework in which diseases, genes and single-nucleotide polymorphisms are represented by a layered network, and key genes are predicted by a machine learning algorithm. The algorithm utilizes a network-based semi-supervised learning model that can be applied to layered data structures. RESULTS: The proposed method was applied to a dataset extracted from public databases related to diseases and genes with data collected from 186 patients. A portion of key genes obtained using the proposed method was verified in silico through PubMed literature, and the remaining genes were left as possible candidate genes. AVAILABILITY AND IMPLEMENTATION: The code for the framework will be available at http://www.alphaminers.net/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘http://www.alphaminers.net/’]

Determining minimum set of driver nodes in protein-protein interaction networks.

BACKGROUND: Recently, several studies have drawn attention to the determination of a minimum set of driver proteins that are important for the control of the underlying protein-protein interaction (PPI) networks. In general, the minimum dominating set (MDS) model is widely adopted. However, because the MDS model does not generate a unique MDS configuration, multiple different MDSs would be generated when using different optimization algorithms. Therefore, among these MDSs, it is difficult to find out the one that represents the true driver set of proteins. RESULTS: To address this problem, we develop a centrality-corrected minimum dominating set (CC-MDS) model which includes heterogeneity in degree and betweenness centralities of proteins. Both the MDS model and the CC-MDS model are applied on three human PPI networks. Unlike the MDS model, the CC-MDS model generates almost the same sets of driver proteins when we implement it using different optimization algorithms. The CC-MDS model targets more high-degree and high-betweenness proteins than the uncorrected counterpart. The more central position allows CC-MDS proteins to be more important in maintaining the overall network connectivity than MDS proteins. To indicate the functional significance, we find that CC-MDS proteins are involved in, on average, more protein complexes and GO annotations than MDS proteins. We also find that more essential genes, aging genes, disease-associated genes and virus-targeted genes appear in CC-MDS proteins than in MDS proteins. As for the involvement in regulatory functions, the sets of CC-MDS proteins show much stronger enrichment of transcription factors and protein kinases. The results about topological and functional significance demonstrate that the CC-MDS model can capture more driver proteins than the MDS model. CONCLUSIONS: Based on the results obtained, the CC-MDS model presents to be a powerful tool for the determination of driver proteins that can control the underlying PPI networks. The software described in this paper and the datasets used are available at https://github.com/Zhangxf-ccnu/CC-MDS .

URL: [‘https://github.com/Zhangxf-ccnu/CC-MDS’]

Easy-HLA: a validated web application suite to reveal the full details of HLA typing.

MOTIVATION: The HLA system plays a pivotal role in both clinical applications and immunology research. Typing HLA genes in patient and donor is indeed required in hematopoietic stem cell and solid-organ transplantation, and the histocompatibility complex region exhibits countless genetic associations with immune-related pathologies. Since the discovery of HLA antigens, the HLA system nomenclature and typing methods have constantly evolved, which leads to difficulties in using data generated with older methodologies. RESULTS: Here, we present Easy-HLA, a web-based software suite designed to facilitate analysis and gain knowledge from HLA typing, regardless of nomenclature or typing method. Easy-HLA implements a computational and statistical method of HLA haplotypes inference based on published reference populations containing over 600 000 haplotypes to upgrade missing or partial HLA information: ‘HLA-Upgrade’ tool infers high-resolution HLA typing and ‘HLA-2-Haplo’ imputes haplotype pairs and provides additional functional annotations (e.g. amino acids and KIR ligands). We validated both tools using two independent cohorts (total n = 2500). For HLA-Upgrade, we reached a prediction accuracy of 92% from low- to high-resolution of European genotypes. We observed a 96% call rate and 76% accuracy with HLA-2-Haplo European haplotype pairs prediction. In conclusion, Easy-HLA tools facilitate large-scale immunogenetic analysis and promotes the multi-faceted HLA expertise beyond allelic associations by providing new functional immunogenomics parameters. AVAILABILITY AND IMPLEMENTATION: Easy-HLA is a web application freely available (free account) at: https://hla.univ-nantes.fr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://hla.univ-nantes.fr’]

Self-supervised learning of neighborhood embedding for longitudinal MRI.

In recent years, several deep learning models recommend first to represent Magnetic Resonance Imaging (MRI) as latent features before performing a downstream task of interest (such as classification or regression). The performance of the downstream task generally improves when these latent representations are explicitly associated with factors of interest. For example, we derived such a representation for capturing brain aging by applying self-supervised learning to longitudinal MRIs and then used the resulting encoding to automatically identify diseases accelerating the aging of the brain. We now propose a refinement of this representation by replacing the linear modeling of brain aging with one that is consistent in local neighborhoods in the latent space. Called Longitudinal Neighborhood Embedding (LNE), we derive an encoding so that neighborhoods are age-consistent (i.e., brain MRIs of different subjects with similar brain ages are in close proximity of each other) and progression-consistent, i.e., the latent space is defined by a smooth trajectory field where each trajectory captures changes in brain ages between a pair of MRIs extracted from a longitudinal sequence. To make the problem computationally tractable, we further propose a strategy for mini-batch sampling so that the resulting local neighborhoods accurately approximate the ones that would be defined based on the whole cohort. We evaluate LNE on three different downstream tasks: (1) to predict chronological age from T1-w MRI of 274 healthy subjects participating in a study at SRI International; (2) to distinguish Normal Control (NC) from Alzheimer’s Disease (AD) and stable Mild Cognitive Impairment (sMCI) from progressive Mild Cognitive Impairment (pMCI) based on T1-w MRI of 632 participants of the Alzheimer’s Disease Neuroimaging Initiative (ADNI); and (3) to distinguish no-to-low from moderate-to-heavy alcohol drinkers based on fractional anisotropy derived from diffusion tensor MRIs of 764 adolescents recruited by the National Consortium on Alcohol and NeuroDevelopment in Adolescence (NCANDA). Across the three data sets, the visualization of the smooth trajectory vector fields and superior accuracy on downstream tasks demonstrate the strength of the proposed method over existing self-supervised methods in extracting information related to brain aging, which could help study the impact of substance use and neurodegenerative disorders. The code is available at https://github.com/ouyangjiahong/longitudinal-neighbourhood-embedding.

URL: [‘https://github.com/ouyangjiahong/longitudinal-neighbourhood-embedding’]

Multi-scale semi-supervised clustering of brain images: Deriving disease subtypes.

Disease heterogeneity is a significant obstacle to understanding pathological processes and delivering precision diagnostics and treatment. Clustering methods have gained popularity for stratifying patients into subpopulations (i.e., subtypes) of brain diseases using imaging data. However, unsupervised clustering approaches are often confounded by anatomical and functional variations not related to a disease or pathology of interest. Semi-supervised clustering techniques have been proposed to overcome this and, therefore, capture disease-specific patterns more effectively. An additional limitation of both unsupervised and semi-supervised conventional machine learning methods is that they typically model, learn and infer from data using a basis of feature sets pre-defined at a fixed anatomical or functional scale (e.g., atlas-based regions of interest). Herein we propose a novel method, “Multi-scAle heteroGeneity analysIs and Clustering” (MAGIC), to depict the multi-scale presentation of disease heterogeneity, which builds on a previously proposed semi-supervised clustering method, HYDRA. It derives multi-scale and clinically interpretable feature representations and exploits a double-cyclic optimization procedure to effectively drive identification of inter-scale-consistent disease subtypes. More importantly, to understand the conditions under which the clustering model can estimate true heterogeneity related to diseases, we conducted extensive and systematic semi-simulated experiments to evaluate the proposed method on a sizeable healthy control sample from the UK Biobank (N = 4403). We then applied MAGIC to imaging data from Alzheimer’s disease (ADNI, N = 1728) and schizophrenia (PHENOM, N = 1166) patients to demonstrate its potential and challenges in dissecting the neuroanatomical heterogeneity of common brain diseases. Taken together, we aim to provide guidance regarding when such analyses can succeed or should be taken with caution. The code of the proposed method is publicly available at https://github.com/anbai106/MAGIC.

URL: [‘https://github.com/anbai106/MAGIC’]

Multilayer meta-matching: Translating phenotypic prediction models from multiple datasets to small data.

Resting-state functional connectivity (RSFC) is widely used to predict phenotypic traits in individuals. Large sample sizes can significantly improve prediction accuracies. However, for studies of certain clinical populations or focused neuroscience inquiries, small-scale datasets often remain a necessity. We have previously proposed a “meta-matching” approach to translate prediction models from large datasets to predict new phenotypes in small datasets. We demonstrated a large improvement over classical kernel ridge regression (KRR) when translating models from a single source dataset (UK Biobank) to the Human Connectome Project Young Adults (HCP-YA) dataset. In the current study, we propose two meta-matching variants (“meta-matching with dataset stacking” and “multilayer meta-matching”) to translate models from multiple source datasets across disparate sample sizes to predict new phenotypes in small target datasets. We evaluate both approaches by translating models trained from five source datasets (with sample sizes ranging from 862 participants to 36,834 participants) to predict phenotypes in the HCP-YA and HCP-Aging datasets. We find that multilayer meta-matching modestly outperforms meta-matching with dataset stacking. Both meta-matching variants perform better than the original “meta-matching with stacking” approach trained only on the UK Biobank. All meta-matching variants outperform classical KRR and transfer learning by a large margin. In fact, KRR is better than classical transfer learning when less than 50 participants are available for finetuning, suggesting the difficulty of classical transfer learning in the very small sample regime. The multilayer meta-matching model is publicly available athttps://github.com/ThomasYeoLab/Meta_matching_models/tree/main/rs-fMRI/v2.0.

URL: [‘https://github.com/ThomasYeoLab/Meta_matching_models/tree/main/rs-fMRI/v2.0’]

Predicting Explainable Dementia Types with LLM-aided Feature Engineering.

MOTIVATION: The integration of Machine Learning (ML) and Artificial Intelligence (AI) into healthcare has immense potential due to the rapidly growing volume of clinical data. However, existing AI models, particularly Large Language Models (LLMs) like GPT-4, face significant challenges in terms of explainability and reliability, particularly in high-stakes domains like healthcare. RESULTS: This paper proposes a novel LLM-aided feature engineering approach that enhances interpretability by extracting clinically relevant features from the Oxford Textbook of Medicine. By converting clinical notes into concept vector representations and employing a linear classifier, our method achieved an accuracy of 0.72, outperforming a traditional n-gram Logistic Regression baseline (0.64) and the GPT-4 baseline (0.48), while focusing on high level clinical features. We also explore using Text Embeddings to reduce the overall time and cost of our approach by 97%. AVAILABILITY: All code relevant to this paper is available at: https://github.com/AdityaKashyap423/Dementia_LLM_Feature_Engineering/tree/main. SUPPLEMENTARY INFORMATION: Supplementary PDF and other data files can be found at https://drive.google.com/drive/folders/1UqdpsKFnvGjUJgp58k3RYcJ8zN8zPmWR?usp=share_link .

URL: [‘https://github.com/AdityaKashyap423/Dementia_LLM_Feature_Engineering/tree/main’, ‘https://drive.google.com/drive/folders/1UqdpsKFnvGjUJgp58k3RYcJ8zN8zPmWR?usp=share_link’]

Identifying and ranking potential driver genes of Alzheimer’s disease using multiview evidence aggregation.

MOTIVATION: Late onset Alzheimer’s disease is currently a disease with no known effective treatment options. To better understand disease, new multi-omic data-sets have recently been generated with the goal of identifying molecular causes of disease. However, most analytic studies using these datasets focus on uni-modal analysis of the data. Here, we propose a data driven approach to integrate multiple data types and analytic outcomes to aggregate evidences to support the hypothesis that a gene is a genetic driver of the disease. The main algorithmic contributions of our article are: (i) a general machine learning framework to learn the key characteristics of a few known driver genes from multiple feature sets and identifying other potential driver genes which have similar feature representations, and (ii) A flexible ranking scheme with the ability to integrate external validation in the form of Genome Wide Association Study summary statistics. While we currently focus on demonstrating the effectiveness of the approach using different analytic outcomes from RNA-Seq studies, this method is easily generalizable to other data modalities and analysis types. RESULTS: We demonstrate the utility of our machine learning algorithm on two benchmark multiview datasets by significantly outperforming the baseline approaches in predicting missing labels. We then use the algorithm to predict and rank potential drivers of Alzheimer’s. We show that our ranked genes show a significant enrichment for single nucleotide polymorphisms associated with Alzheimer’s and are enriched in pathways that have been previously associated with the disease. AVAILABILITY AND IMPLEMENTATION: Source code and link to all feature sets is available at https://github.com/Sage-Bionetworks/EvidenceAggregatedDriverRanking.

URL: [‘https://github.com/Sage-Bionetworks/EvidenceAggregatedDriverRanking’]

dsRID: in silico identification of dsRNA regions using long-read RNA-seq data.

MOTIVATION: Double-stranded RNAs (dsRNAs) are potent triggers of innate immune responses upon recognition by cytosolic dsRNA sensor proteins. Identification of endogenous dsRNAs helps to better understand the dsRNAome and its relevance to innate immunity related to human diseases. RESULTS: Here, we report dsRID (double-stranded RNA identifier), a machine learning-based method to predict dsRNA regions in silico, leveraging the power of long-read RNA-sequencing (RNA-seq) and molecular traits of dsRNAs. Using models trained with PacBio long-read RNA-seq data derived from Alzheimer’s disease (AD) brain, we show that our approach is highly accurate in predicting dsRNA regions in multiple data sets. Applied to an AD cohort sequenced by the ENCODE consortium, we characterize the global dsRNA profile with potentially distinct expression patterns between AD and controls. Together, we show that dsRID provides an effective approach to capture global dsRNA profiles using long-read RNA-seq data. AVAILABILITY: Software implementation of dsRID, and genomic coordinates of regions predicted by dsRID in all samples are available at the GitHub repository: https://github.com/gxiaolab/dsRID. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/gxiaolab/dsRID’]

pyaging: a Python-based compendium of GPU-optimized aging clocks.

MOTIVATION: Aging is intricately linked to diseases and mortality. It is reflected in molecular changes across various tissues which can be leveraged for the development of biomarkers of aging using machine learning models, known as aging clocks. Despite advancements in the field, a significant challenge remains: the lack of robust, Python-based software tools for integrating and comparing these diverse models. This gap highlights the need for comprehensive solutions that can handle the complexity and variety of data in aging research. RESULTS: To address this gap, I introduce pyaging, a comprehensive open-source Python package designed to facilitate aging research. pyaging harmonizes dozens of aging clocks, covering a range of molecular data types such as DNA methylation, transcriptomics, histone mark ChIP-Seq, and ATAC-Seq. The package is not limited to traditional model types; it features a diverse array, from linear and principal component models to neural networks and automatic relevance determination models. Thanks to a PyTorch-based backend that enables GPU acceleration, pyaging is capable of rapid inference, even when dealing with large datasets and complex models. Additionally, the package’s support for multi-species analysis extends its utility across various organisms, including humans, various mammals, and C. elegans. AVAILABILITY AND IMPLEMENTATION: pyaging is accessible on GitHub, at https://github.com/rsinghlab/pyaging, and the distribution is available on PyPi, at https://pypi.org/project/pyaging/. The software is also archived on Zenodo, at https://zenodo.org/doi/10.5281/zenodo.10335011. SUPPLEMENTARY INFORMATION: Supplementary materials, including detailed documentation and usage examples, can be found online at the pyaging documentation site https://pyaging.readthedocs.io/en/latest/index.html.

URL: [‘https://github.com/rsinghlab/pyaging’, ‘https://pypi.org/project/pyaging/’, ‘https://zenodo.org/doi/10.5281/zenodo.10335011’, ‘https://pyaging.readthedocs.io/en/latest/index.html’]

ReRF-Pred: predicting amyloidogenic regions of proteins based on their pseudo amino acid composition and tripeptide composition.

BACKGROUND: Amyloids are insoluble fibrillar aggregates that are highly associated with complex human diseases, such as Alzheimer’s disease, Parkinson’s disease, and type II diabetes. Recently, many studies reported that some specific regions of amino acid sequences may be responsible for the amyloidosis of proteins. It has become very important for elucidating the mechanism of amyloids that identifying the amyloidogenic regions. Accordingly, several computational methods have been put forward to discover amyloidogenic regions. The majority of these methods predicted amyloidogenic regions based on the physicochemical properties of amino acids. In fact, position, order, and correlation of amino acids may also influence the amyloidosis of proteins, which should be also considered in detecting amyloidogenic regions. RESULTS: To address this problem, we proposed a novel machine-learning approach for predicting amyloidogenic regions, called ReRF-Pred. Firstly, the pseudo amino acid composition (PseAAC) was exploited to characterize physicochemical properties and correlation of amino acids. Secondly, tripeptides composition (TPC) was employed to represent the order and position of amino acids. To improve the distinguishability of TPC, all possible tripeptides were analyzed by the binomial distribution method, and only those which have significantly different distribution between positive and negative samples remained. Finally, all samples were characterized by PseAAC and TPC of their amino acid sequence, and a random forest-based amyloidogenic regions predictor was trained on these samples. It was proved by validation experiments that the feature set consisted of PseAAC and TPC is the most distinguishable one for detecting amyloidosis. Meanwhile, random forest is superior to other concerned classifiers on almost all metrics. To validate the effectiveness of our model, ReRF-Pred is compared with a series of gold-standard methods on two datasets: Pep-251 and Reg33. The results suggested our method has the best overall performance and makes significant improvements in discovering amyloidogenic regions. CONCLUSIONS: The advantages of our method are mainly attributed to that PseAAC and TPC can describe the differences between amyloids and other proteins successfully. The ReRF-Pred server can be accessed at http://106.12.83.135:8080/ReRF-Pred/.

URL: [‘http://106.12.83.135:8080/ReRF-Pred/’]

SliceMap: an algorithm for automated brain region annotation.

Summary: Many neurodegenerative disorders, such as Alzheimer’s Disease, pertain to or spread from specific sites of the brain. Hence, accurate disease staging or therapy assessment in transgenic model mice demands automated analysis of selected brain regions. To address this need, we have developed an algorithm, termed SliceMap, that enables contextual quantification by mapping anatomical information onto microtome-cut brain slices. For every newly acquired high-resolution image of a brain slice, the algorithm performs a coarse congealing-based registration to a library of pre-annotated reference slices. A subset of optimally matching reference slices is then used for refined, elastic registration. Morphotextural metrics are used to measure registration performance and to automatically detect poorly cut slices. We have implemented our method as a plugin for FIJI image analysis freeware, and we have used it to regionally quantify tau pathology in brain slices from a tauopathy (P301S) mouse model. By enabling region-based quantification, our method contributes to a more accurate assessment of neurodegenerative disease development. Availability and implementation: The method is available as a plugin for FIJI from https://github.com/mbarbie1/SliceMap/, along with an example dataset and user instructions. Contact: winnok.devos@uantwerpen.be. Supplementary information: Supplementary data are available at Bioinformatics online.

URL: [‘https://github.com/mbarbie1/SliceMap/’]

CarSite-II: an integrated classification algorithm for identifying carbonylated sites based on K-means similarity-based undersampling and synthetic minority oversampling techniques.

BACKGROUND: Carbonylation is a non-enzymatic irreversible protein post-translational modification, and refers to the side chain of amino acid residues being attacked by reactive oxygen species and finally converted into carbonyl products. Studies have shown that protein carbonylation caused by reactive oxygen species is involved in the etiology and pathophysiological processes of aging, neurodegenerative diseases, inflammation, diabetes, amyotrophic lateral sclerosis, Huntington’s disease, and tumor. Current experimental approaches used to predict carbonylation sites are expensive, time-consuming, and limited in protein processing abilities. Computational prediction of the carbonylation residue location in protein post-translational modifications enhances the functional characterization of proteins. RESULTS: In this study, an integrated classifier algorithm, CarSite-II, was developed to identify K, P, R, and T carbonylated sites. The resampling method K-means similarity-based undersampling and the synthetic minority oversampling technique (SMOTE-KSU) were incorporated to balance the proportions of K, P, R, and T carbonylated training samples. Next, the integrated classifier system Rotation Forest uses “support vector machine” subclassifications to divide three types of feature spaces into several subsets. CarSite-II gained Matthew’s correlation coefficient (MCC) values of 0.2287/0.3125/0.2787/0.2814, False Positive rate values of 0.2628/0.1084/0.1383/0.1313, False Negative rate values of 0.2252/0.0205/0.0976/0.0608 for K/P/R/T carbonylation sites by tenfold cross-validation, respectively. On our independent test dataset, CarSite-II yield MCC values of 0.6358/0.2910/0.4629/0.3685, False Positive rate values of 0.0165/0.0203/0.0188/0.0094, False Negative rate values of 0.1026/0.1875/0.2037/0.3333 for K/P/R/T carbonylation sites. The results show that CarSite-II achieves remarkably better performance than all currently available prediction tools. CONCLUSION: The related results revealed that CarSite-II achieved better performance than the currently available five programs, and revealed the usefulness of the SMOTE-KSU resampling approach and integration algorithm. For the convenience of experimental scientists, the web tool of CarSite-II is available in http://47.100.136.41:8081/.

URL: [‘http://47.100.136.41:8081/’]

AAgMarker 1.0: a resource of serological autoantigen biomarkers for clinical diagnosis and prognosis of various human diseases.

Autoantibodies are produced to target an individual’s own antigens (e.g. proteins). They can trigger autoimmune responses and inflammation, and thus, cause many types of diseases. Many high-throughput autoantibody profiling projects have been reported for unbiased identification of serological autoantigen-based biomarkers. However, a lack of centralized data portal for these published assays has been a major obstacle to further data mining and cross-evaluate the quality of these datasets generated from different diseases. Here, we introduce a user-friendly database, AAgMarker 1.0, which collects many published raw datasets obtained from serum profiling assays on the proteome microarrays, and provides a toolbox for mining these data. The current version of AAgMarker 1.0 contains 854 serum samples, involving 136 092 proteins. A total of 7803 (4470 non-redundant) candidate autoantigen biomarkers were identified and collected for 12 diseases, such as Alzheimer’s disease, Bechet’s disease and Parkinson’s disease. Seven statistical parameters are introduced to quantitatively assess these biomarkers. Users can retrieve, analyse and compare the datasets through basic search, advanced search and browse. These biomarkers are also downloadable by disease terms. The AAgMarker 1.0 is now freely accessible at http://bioinfo.wilmer.jhu.edu/AAgMarker/. We believe this database will be a valuable resource for the community of both biomedical and clinical research.

URL: [‘http://bioinfo.wilmer.jhu.edu/AAgMarker/’]

Disentangling Normal Aging From Severity of Disease via Weak Supervision on Longitudinal MRI.

The continuous progression of neurological diseases are often categorized into conditions according to their severity. To relate the severity to changes in brain morphometry, there is a growing interest in replacing these categories with a continuous severity scale that longitudinal MRIs are mapped onto via deep learning algorithms. However, existing methods based on supervised learning require large numbers of samples and those that do not, such as self-supervised models, fail to clearly separate the disease effect from normal aging. Here, we propose to explicitly disentangle those two factors via weak-supervision. In other words, training is based on longitudinal MRIs being labelled either normal or diseased so that the training data can be augmented with samples from disease categories that are not of primary interest to the analysis. We do so by encouraging trajectories of controls to be fully encoded by the direction associated with brain aging. Furthermore, an orthogonal direction linked to disease severity captures the residual component from normal aging in the diseased cohort. Hence, the proposed method quantifies disease severity and its progression speed in individuals without knowing their condition. We apply the proposed method on data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI, N =632 ). We then show that the model properly disentangled normal aging from the severity of cognitive impairment by plotting the resulting disentangled factors of each subject and generating simulated MRIs for a given chronological age and condition. Moreover, our representation obtains higher balanced accuracy when used for two downstream classification tasks compared to other pre-training approaches. The code for our weak-supervised approach is available at https://github.com/ouyangjiahong/longitudinal-direction-disentangle.

URL: [‘https://github.com/ouyangjiahong/longitudinal-direction-disentangle’]

Synthesizing individualized aging brains in health and disease with generative models and parallel transport.

Simulating prospective magnetic resonance imaging (MRI) scans from a given individual brain image is challenging, as it requires accounting for canonical changes in aging and/or disease progression while also considering the individual brain’s current status and unique characteristics. While current deep generative models can produce high-resolution anatomically accurate templates for population-wide studies, their ability to predict future aging trajectories for individuals remains limited, particularly in capturing subject-specific neuroanatomical variations over time. In this study, we introduce Individualized Brain Synthesis (InBrainSyn), a framework for synthesizing high-resolution subject-specific longitudinal MRI scans that simulate neurodegeneration in both Alzheimer’s disease (AD) and normal aging. InBrainSyn uses a parallel transport algorithm to adapt the population-level aging trajectories learned by a generative deep template network, enabling individualized aging synthesis. As InBrainSyn uses diffeomorphic transformations to simulate aging, the synthesized images are topologically consistent with the original anatomy by design. We evaluated InBrainSyn both quantitatively and qualitatively on AD and healthy control cohorts from the Open Access Series of Imaging Studies - version 3 dataset. Experimentally, InBrainSyn can also model neuroanatomical transitions between normal aging and AD. An evaluation of an external set supports its generalizability. Overall, with only a single baseline scan, InBrainSyn synthesizes realistic 3D spatiotemporal T1w MRI scans, producing personalized longitudinal aging trajectories. The code for InBrainSyn is available at https://github.com/Fjr9516/InBrainSyn.

URL: [‘https://github.com/Fjr9516/InBrainSyn’]

Brain Latent Progression: Individual-based spatiotemporal disease progression on 3D Brain MRIs via latent diffusion.

The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction at the global and voxel level. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: https://github.com/LemuelPuglisi/BrLP.

URL: [‘https://github.com/LemuelPuglisi/BrLP’]

PPAD: a deep learning architecture to predict progression of Alzheimer’s disease.

MOTIVATION: Alzheimer’s disease (AD) is a neurodegenerative disease that affects millions of people worldwide. Mild cognitive impairment (MCI) is an intermediary stage between cognitively normal state and AD. Not all people who have MCI convert to AD. The diagnosis of AD is made after significant symptoms of dementia such as short-term memory loss are already present. Since AD is currently an irreversible disease, diagnosis at the onset of the disease brings a huge burden on patients, their caregivers, and the healthcare sector. Thus, there is a crucial need to develop methods for the early prediction AD for patients who have MCI. Recurrent neural networks (RNN) have been successfully used to handle electronic health records (EHR) for predicting conversion from MCI to AD. However, RNN ignores irregular time intervals between successive events which occurs common in electronic health record data. In this study, we propose two deep learning architectures based on RNN, namely Predicting Progression of Alzheimer’s Disease (PPAD) and PPAD-Autoencoder. PPAD and PPAD-Autoencoder are designed for early predicting conversion from MCI to AD at the next visit and multiple visits ahead for patients, respectively. To minimize the effect of the irregular time intervals between visits, we propose using age in each visit as an indicator of time change between successive visits. RESULTS: Our experimental results conducted on Alzheimer’s Disease Neuroimaging Initiative and National Alzheimer’s Coordinating Center datasets showed that our proposed models outperformed all baseline models for most prediction scenarios in terms of F2 and sensitivity. We also observed that the age feature was one of top features and was able to address irregular time interval problem. AVAILABILITY AND IMPLEMENTATION: https://github.com/bozdaglab/PPAD.

URL: [‘https://github.com/bozdaglab/PPAD’]

TA-RNN: an attention-based time-aware recurrent neural network architecture for electronic health records.

MOTIVATION: Electronic health records (EHRs) represent a comprehensive resource of a patient’s medical history. EHRs are essential for utilizing advanced technologies such as deep learning (DL), enabling healthcare providers to analyze extensive data, extract valuable insights, and make precise and data-driven clinical decisions. DL methods such as recurrent neural networks (RNN) have been utilized to analyze EHR to model disease progression and predict diagnosis. However, these methods do not address some inherent irregularities in EHR data such as irregular time intervals between clinical visits. Furthermore, most DL models are not interpretable. In this study, we propose two interpretable DL architectures based on RNN, namely time-aware RNN (TA-RNN) and TA-RNN-autoencoder (TA-RNN-AE) to predict patient’s clinical outcome in EHR at the next visit and multiple visits ahead, respectively. To mitigate the impact of irregular time intervals, we propose incorporating time embedding of the elapsed times between visits. For interpretability, we propose employing a dual-level attention mechanism that operates between visits and features within each visit. RESULTS: The results of the experiments conducted on Alzheimer’s Disease Neuroimaging Initiative (ADNI) and National Alzheimer’s Coordinating Center (NACC) datasets indicated the superior performance of proposed models for predicting Alzheimer’s Disease (AD) compared to state-of-the-art and baseline approaches based on F2 and sensitivity. Additionally, TA-RNN showed superior performance on the Medical Information Mart for Intensive Care (MIMIC-III) dataset for mortality prediction. In our ablation study, we observed enhanced predictive performance by incorporating time embedding and attention mechanisms. Finally, investigating attention weights helped identify influential visits and features in predictions. AVAILABILITY AND IMPLEMENTATION: https://github.com/bozdaglab/TA-RNN.

URL: [‘https://github.com/bozdaglab/TA-RNN’]

Fast and accurate modelling of longitudinal and repeated measures neuroimaging data.

Despite the growing importance of longitudinal data in neuroimaging, the standard analysis methods make restrictive or unrealistic assumptions (e.g., assumption of Compound Symmetry–the state of all equal variances and equal correlations–or spatially homogeneous longitudinal correlations). While some new methods have been proposed to more accurately account for such data, these methods are based on iterative algorithms that are slow and failure-prone. In this article, we propose the use of the Sandwich Estimator method which first estimates the parameters of interest with a simple Ordinary Least Square model and second estimates variances/covariances with the “so-called” Sandwich Estimator (SwE) which accounts for the within-subject correlation existing in longitudinal data. Here, we introduce the SwE method in its classic form, and we review and propose several adjustments to improve its behaviour, specifically in small samples. We use intensive Monte Carlo simulations to compare all considered adjustments and isolate the best combination for neuroimaging data. We also compare the SwE method to other popular methods and demonstrate its strengths and weaknesses. Finally, we analyse a highly unbalanced longitudinal dataset from the Alzheimer’s Disease Neuroimaging Initiative and demonstrate the flexibility of the SwE method to fit within- and between-subject effects in a single model. Software implementing this SwE method has been made freely available at http://warwick.ac.uk/tenichols/SwE.

URL: [‘http://warwick.ac.uk/tenichols/SwE’]

FONDUE: Robust resolution-invariant denoising of MR images using Nested UNets.

Recent human magnetic resonance imaging (MRI) studies continually push the boundaries of spatial resolution as a means to enhance levels of neuroanatomical detail and increase the accuracy and sensitivity of derived brain morphometry measures. However, acquisitions required to achieve these resolutions have a higher noise floor, potentially impacting segmentation and morphometric analysis results. This study proposes a novel, fast, robust, and resolution-invariant deep learning method to denoise structural human brain MRIs. We explore denoising of T1-weighted (T1w) brain images from varying field strengths (1.5T to 7T), voxel sizes (1.2 mm to 250 microm), scanner vendors (Siemens, GE, and Phillips), and diseased and healthy participants from a wide age range (young adults to aging individuals). Our proposed Fast-Optimized Network for Denoising through residual Unified Ensembles (FONDUE) method demonstrated stable denoising capabilities across multiple resolutions with performance on par or superior to the state-of-the-art methods while being several orders of magnitude faster at low relative cost when using a dedicated Graphics Processing Unit (GPU). FONDUE achieved the best performance on at least one of the four denoising-performance metrics on all the test datasets used, showing its generalization capabilities and stability. Due to its high-quality performance, robustness, fast execution times, and relatively low-GPU memory requirements, as well as its open-source public availability, FONDUE can be widely used for structural MRI denoising, especially in large-cohort studies. We have made the FONDUE repository and all training and evaluation scripts as well as the trained weights available athttps://github.com/waadgo/FONDUE.

URL: [‘https://github.com/waadgo/FONDUE’]

IGUANe: A 3D generalizable CycleGAN for multicenter harmonization of brain MR images.

In MRI studies, the aggregation of imaging data from multiple acquisition sites enhances sample size but may introduce site-related variabilities that hinder consistency in subsequent analyses. Deep learning methods for image translation have emerged as a solution for harmonizing MR images across sites. In this study, we introduce IGUANe (Image Generation with Unified Adversarial Networks), an original 3D model that leverages the strengths of domain translation and straightforward application of style transfer methods for multicenter brain MR image harmonization. IGUANe extends CycleGAN by integrating an arbitrary number of domains for training through a many-to-one architecture. The framework based on domain pairs enables the implementation of sampling strategies that prevent confusion between site-related and biological variabilities. During inference, the model can be applied to any image, even from an unknown acquisition site, making it a universal generator for harmonization. Trained on a dataset comprising T1-weighted images from 11 different scanners, IGUANe was evaluated on data from unseen sites. The assessments included the transformation of MR images with traveling subjects, the preservation of pairwise distances between MR images within domains, the evolution of volumetric patterns related to age and Alzheimer’s disease (AD), and the performance in age regression and patient classification tasks. Comparisons with other harmonization and normalization methods suggest that IGUANe better preserves individual information in MR images and is more suitable for maintaining and reinforcing variabilities related to age and AD. Future studies may further assess IGUANe in other multicenter contexts, either using the same model or retraining it for applications to different image modalities. Codes and the trained IGUANe model are available at https://github.com/RocaVincent/iguane_harmonization.git.

URL: [‘https://github.com/RocaVincent/iguane_harmonization.git’]

Computational refinement of post-translational modifications predicted from tandem mass spectrometry.

MOTIVATION: A post-translational modification (PTM) is a chemical modification of a protein that occurs naturally. Many of these modifications, such as phosphorylation, are known to play pivotal roles in the regulation of protein function. Henceforth, PTM perturbations have been linked to diverse diseases like Parkinson’s, Alzheimer’s, diabetes and cancer. To discover PTMs on a genome-wide scale, there is a recent surge of interest in analyzing tandem mass spectrometry data, and several unrestrictive (so-called ‘blind’) PTM search methods have been reported. However, these approaches are subject to noise in mass measurements and in the predicted modification site (amino acid position) within peptides, which can result in false PTM assignments. RESULTS: To address these issues, we devised a machine learning algorithm, PTMClust, that can be applied to the output of blind PTM search methods to improve prediction quality, by suppressing noise in the data and clustering peptides with the same underlying modification to form PTM groups. We show that our technique outperforms two standard clustering algorithms on a simulated dataset. Additionally, we show that our algorithm significantly improves sensitivity and specificity when applied to the output of three different blind PTM search engines, SIMS, InsPecT and MODmap. Additionally, PTMClust markedly outperforms another PTM refinement algorithm, PTMFinder. We demonstrate that our technique is able to reduce false PTM assignments, improve overall detection coverage and facilitate novel PTM discovery, including terminus modifications. We applied our technique to a large-scale yeast MS/MS proteome profiling dataset and found numerous known and novel PTMs. Accurately identifying modifications in protein sequences is a critical first step for PTM profiling, and thus our approach may benefit routine proteomic analysis. AVAILABILITY: Our algorithm is implemented in Matlab and is freely available for academic use. The software is available online from http://genes.toronto.edu.

URL: [‘http://genes.toronto.edu’]

DRAW+SneakPeek: analysis workflow and quality metric management for DNA-seq experiments.

SUMMARY: We report our new DRAW+SneakPeek software for DNA-seq analysis. DNA resequencing analysis workflow (DRAW) automates the workflow of processing raw sequence reads including quality control, read alignment and variant calling on high-performance computing facilities such as Amazon elastic compute cloud. SneakPeek provides an effective interface for reviewing dozens of quality metrics reported by DRAW, so users can assess the quality of data and diagnose problems in their sequencing procedures. Both DRAW and SneakPeek are freely available under the MIT license, and are available as Amazon machine images to be used directly on Amazon cloud with minimal installation. AVAILABILITY: DRAW+SneakPeek is released under the MIT license and is available for academic and nonprofit use for free. The information about source code, Amazon machine images and instructions on how to install and run DRAW+SneakPeek locally and on Amazon elastic compute cloud is available at the National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (http://www.niagads.org/) and Wang lab Web site (http://wanglab.pcbi.upenn.edu/).

URL: [‘http://www.niagads.org/’, ‘http://wanglab.pcbi.upenn.edu/’]

nBEST: Deep-learning-based non-human primates Brain Extraction and Segmentation Toolbox across ages, sites and species.

Accurate processing and analysis of non-human primate (NHP) brain magnetic resonance imaging (MRI) serves an indispensable role in understanding brain evolution, development, aging, and diseases. Despite the accumulation of diverse NHP brain MRI datasets at various developmental stages and from various imaging sites/scanners, existing computational tools designed for human MRI typically perform poor on NHP data, due to huge differences in brain sizes, morphologies, and imaging appearances across species, sites, and ages, highlighting the imperative for NHP-specialized MRI processing tools. To address this issue, in this paper, we present a robust, generic, and fully automated computational pipeline, called non-human primates Brain Extraction and Segmentation Toolbox (nBEST), whose main functionality includes brain extraction, non-cerebrum removal, and tissue segmentation. Building on cutting-edge deep learning techniques by employing lifelong learning to flexibly integrate data from diverse NHP populations and innovatively constructing 3D U-NeXt architecture, nBEST can well handle structural NHP brain MR images from multi-species, multi-site, and multi-developmental-stage (from neonates to the elderly). We extensively validated nBEST based on, to our knowledge, the largest assemblage dataset in NHP brain studies, encompassing 1,469 scans with 11 species (e.g., rhesus macaques, cynomolgus macaques, chimpanzees, marmosets, squirrel monkeys, etc.) from 23 independent datasets. Compared to alternative tools, nBEST outperforms in precision, applicability, robustness, comprehensiveness, and generalizability, greatly benefiting downstream longitudinal, cross-sectional, and cross-species quantitative analyses. We have made nBEST an open-source toolbox (https://github.com/TaoZhong11/nBEST) and we are committed to its continual refinement through lifelong learning with incoming data to greatly contribute to the research field.

URL: [‘https://github.com/TaoZhong11/nBEST’]

rPOP: Robust PET-only processing of community acquired heterogeneous amyloid-PET data.

The reference standard for amyloid-PET quantification requires structural MRI (sMRI) for preprocessing in both multi-site research studies and clinical trials. Here we describe rPOP (robust PET-Only Processing), a MATLAB-based MRI-free pipeline implementing non-linear warping and differential smoothing of amyloid-PET scans performed with any of the FDA-approved radiotracers (18F-florbetapir/FBP, 18F-florbetaben/FBB or 18F-flutemetamol/FLUTE). Each image undergoes spatial normalization based on weighted PET templates and data-driven differential smoothing, then allowing users to perform their quantification of choice. Prior to normalization, users can choose whether to automatically reset the origin of the image to the center of mass or proceed with the pipeline with the image as it is. We validate rPOP with n = 740 (514 FBP, 182 FBB, 44 FLUTE) amyloid-PET scans from the Imaging Dementia-Evidence for Amyloid Scanning - Brain Health Registry sub-study (IDEAS-BHR) and n = 1,518 scans from the Alzheimer’s Disease Neuroimaging Initiative (n = 1,249 FBP, n = 269 FBB), including heterogeneous acquisition and reconstruction protocols. After running rPOP, a standard quantification to extract Standardized Uptake Value ratios and the respective Centiloids conversion was performed. rPOP-based amyloid status (using an independent pathology-based threshold of >=24.4 Centiloid units) was compared with either local visual reads (IDEAS-BHR, n = 663 with complete valid data and reads available) or with amyloid status derived from an MRI-based PET processing pipeline (ADNI, thresholds of >20/>18 Centiloids for FBP/FBB). Finally, within the ADNI dataset, we tested the linear associations between rPOP- and MRI-based Centiloid values. rPOP achieved accurate warping for N = 2,233/2,258 (98.9%) in the first pass. Of the N = 25 warping failures, 24 were rescued with manual reorientation and origin reset prior to warping. We observed high concordance between rPOP-based amyloid status and both visual reads (IDEAS-BHR, Cohen’s k = 0.72 [0.7-0.74], ~86% concordance) or MRI-pipeline based amyloid status (ADNI, k = 0.88 [0.87-0.89], ~94% concordance). rPOP- and MRI-pipeline based Centiloids were strongly linearly related (R2:0.95, p<0.001), with this association being significantly modulated by estimated PET resolution (beta= -0.016, p<0.001). rPOP provides reliable MRI-free amyloid-PET warping and quantification, leveraging widely available software and only requiring an attenuation-corrected amyloid-PET image as input. The rPOP pipeline enables the comparison and merging of heterogeneous datasets and is publicly available at https://github.com/leoiacca/rPOP.

URL: [‘https://github.com/leoiacca/rPOP’]

StemMapper: a curated gene expression database for stem cell lineage analysis.

Transcriptomic data have become a fundamental resource for stem cell (SC) biologists as well as for a wider research audience studying SC-related processes such as aging, embryonic development and prevalent diseases including cancer, diabetes and neurodegenerative diseases. Access and analysis of the growing amount of freely available transcriptomics datasets for SCs, however, are not trivial tasks. Here, we present StemMapper, a manually curated gene expression database and comprehensive resource for SC research, built on integrated data for different lineages of human and mouse SCs. It is based on careful selection, standardized processing and stringent quality control of relevant transcriptomics datasets to minimize artefacts, and includes currently over 960 transcriptomes covering a broad range of SC types. Each of the integrated datasets was individually inspected and manually curated. StemMapper’s user-friendly interface enables fast querying, comparison, and interactive visualization of quality-controlled SC gene expression data in a comprehensive manner. A proof-of-principle analysis discovering novel putative astrocyte/neural SC lineage markers exemplifies the utility of the integrated data resource. We believe that StemMapper can open the way for new insights and advances in SC research by greatly simplifying the access and analysis of SC transcriptomic data. StemMapper is freely accessible at http://stemmapper.sysbiolab.eu.

URL: [‘http://stemmapper.sysbiolab.eu’]

MassCube improves accuracy for metabolomics data processing from raw files to phenotype classifiers.

Nontargeted peak detection in LC-MS-based metabolomics must become robust and benchmarked. We present MassCube, a Python-based open-source framework for MS data processing that we systematically benchmark against other algorithms and different types of input data. From raw data, peaks are detected by constructing mass traces through signal clustering and Gaussian-filter assisted edge detection. Peaks are then grouped for adduct and in-source fragment detection, and compounds are annotated by both identity- and fuzzy searches. Final data tables undergo quality controls and can be used for metabolome-informed phenotype prediction. Peak detection in MassCube achieves 100% signal coverage with comprehensive reporting of chromatographic metadata for quality assurance. MassCube outperforms MS-DIAL, MZmine3 or XCMS for speed, isomer detection, and accuracy. It supports diverse numerical routines for MS data analysis while maintaining efficiency, capable for handling 105 GB of Astral MS data on a laptop within 64 min, while other programs took 8-24 times longer. MassCube automatically detected age, sex and regional differences when applied to the Metabolome Atlas of the Aging Mouse Brain data despite batch effects. MassCube is available at https://github.com/huaxuyu/masscube for direct use or implementation into larger applications in omics or biomedical research.

URL: [‘https://github.com/huaxuyu/masscube’]