Skip to main content

Enhancing landslide susceptibility mapping using a positive-unlabeled machine learning approach: a case study in Chamoli, India

Abstract

Introduction

The Indian Himalayas' susceptibility to landslides, particularly as a location where climate change effects may be event catalysts, necessitates the development of dependable landslide susceptibility maps (LSM).

Method

This study diverges from traditional binary classification models, framing LSM as a positive-unlabeled learning problem. This approach acknowledges that regions without recorded landslides are not necessarily at low risk but could simply have not experienced landslides yet. The study utilizes novel positive-unlabeled learning-enhanced algorithms—Random Forest, K-Nearest Neighbor, and Decision Tree—to create LSM for Chamoli district, India. Eleven causative factors for landslides are identified, including elevation, aspect, slope, geology, geomorphology, distance to lineament, lithology, NDVI, distance to river, distance to road and residential land use. To address spatial correlation biases, instead of randomly splitting the dataset, the study adopts spatial splitting to get the training and testing datasets.

Conclusion

The study reveals that positive-unlabeled learning substantially improves the Area Under Curve and recall, leading to a more conservative LSM compared to binary classification methods. Analysis shows that the southern region of Chamoli exhibits high recall but lower accuracy, suggesting a latent high landslide susceptibility despite a lack of historical landslides in this region. The study also quantifies the impact of human activity on landslide risk, indicating an elevated threat to life and the local economy, especially in Chamoli's southwestern areas.

Introduction

Annually, landslides in India result in a monetary loss of approximately US$ 20–30 M (Parkash 2011), impacting a minimum of 15% of India's total land area (Sangeeta and Maheshwari 2019). The Himalayan region within India is especially susceptible to landslides due to its steep slopes, inclement weather patterns, unregulated human settlements, and geodynamic activity (Chauhan et al. 2010). With the emergence of climate change as an added stressor, landslide occurrence in this area is likely to increase further. Creating a dependable LSM that offers regional insights regarding landslide susceptibility to both the public and government entities may help alleviate this hazard and minimize the damage resulting from future climate change catalyzed landslides.

Several studies using various methods have been adopted to generate an LSM for the Himalayan region of India (Ghosh and Bhattacharya 2010; Sarkar et al. 2008; Vijith et al. 2009). Among the explored methods, machine learning is known for capturing the nonlinear relationship between the landslide causative factors and landslides, and generating cell-calibrated output (Das et al. 2023). Chauhan et al. (2010) used a logistic regression model to create a landslide susceptibility zonation (LSZ) map of 600 km2 area in the Chamoli region in the Garhwal Himalayas. The study found that 71.13% of observed landslides occur in the predicted very high and high susceptibility zones, covering 21.96% of the LSZ. Saha et al. (2020) evaluated a set of individual and ensemble machine learning methods to generate an LSM for the Rudraprayag District of Uttarakhand, India. The study considered 16 different landslide conditioning factors, and the dataset was randomly split into training and testing sets. Their study found that the Artificial Neural Network—Random Forest—Logistic Regression (ANN-RF-LR) model performs the best, with an 88% area under the success rate curve and a 94% area under the prediction rate curve. Saha et al. (2021) employed an ensemble of conditional probability and boosted regression tree-based method to develop a landslide susceptibility map for Rudraprayag, India. Their findings indicated that fallow land and roadsides with high elevation and steep slopes exhibit a significantly higher susceptibility to landslide occurrences. Juyal and Sharma (2021) explored the feasibility of utilizing machine learning techniques to develop a landslide susceptibility map in Uttarakhand, India. They highlighted that the prevalent methods for LSM generation include Support Vector Machines (SVM), Random Forest (RF), and Artificial Neural Networks (ANN). Furthermore, they suggested that an ensemble of machine learning algorithms can offer a superior performance. Kainthura and Sharma (2022) conducted a comparative study to assess the efficacy of three different machine learning techniques, namely RF, Neural Network, and Bayesian Network, for developing an LSM of the Uttarkashi region, India. The study employed eleven landslide factors and randomly divided the dataset into training and testing sets. The results indicated that RF yields the most favorable outcome, exhibiting a testing accuracy of 86%. Solanki et al. (2022) used machine learning algorithms to prepare landslide susceptibility maps for the Kali River Valley in the Kumaun Himalaya region. The study considered 15 landslide conditioning factors and used 368 landslides for analysis. The results indicated that slope, elevation, and distance to thrust are the most significant factors. The study concluded that the ensemble XGB and RF algorithms outperform KNN, with approximately 85% accuracy compared to KNN's 81%. In a recent investigation by Badola et al. (2023), the landslide susceptibility of the river Alakananda in Chamoli, Uttarakhand, India, was analyzed using the XGBoost method and seven landslide factors. The results of the study demonstrated a high susceptibility along the road, with an AUC value of 91% being achieved. Table 1 shows the comparison of landslide conditioning factors and predictive model performance among the literatures above.

Table 1 Overview of landslide conditioning factors and predictive model performances in using machine learning to generate the LSM for the Himalayan region of India

Traditional machine learning techniques applied to create LSM typically frame the challenge as a binary classification task, distinguishing samples into either 'landslide' (positive) or 'non-landslide' (negative) categories. Such a binary distinction might be an oversimplification, as the 'non-landslide' category not only encompasses truly landslide-free areas (negative) but also regions with latent landslide potential (positive). Consequently, LSM generation should not be treated as a conventional binary classification machine learning problem (Fang et al. 2021). Wu et al. (2020) first used Positive Unlabeled (PU)-bagging decision tree in generating LSM. for landslide-prone areas, considering that only positive and unlabeled data are provided in the landslide susceptibility modeling field. PU bagging refers to creating a training set by combining all positive data points with a random sample from the unlabeled points, with replacement. The results demonstrated that this method outperforms other established techniques, such as logistic regression, support vector machines, and artificial neural networks, in producing accurate LSM. Fang et al. (2021) applied a PU learning approach in combination with adaptive sampling and RF to predict landslide susceptibility in the Three Gorges Reservoir area of China, which achieved desirable prediction outcomes. Yao et al. (2022) classified LSM as a P.U. learning problem and introduced a two-step deep learning (T-DNN) framework based on the Spy technique. This approach enhances the selection of negative samples and improves the accuracy and reliability of landslide susceptibility assessment.

This study pioneers the application of a PU learning methodology to craft a high-fidelity LSM for the Himalayan region of India. Unlike traditional machine learning methods, the LSM created through PU learning exhibits a more conservative approach, which is crucial for the Himalayan area, where landslides pose a continual risk to communities. The LSM produced in this research is able to identify regions with a high landslide susceptibility even if they have never experienced landslides, providing valuable insights for local authorities. This research not only underscores the importance of implementing PU learning for LSM development but also establishes a pioneering model for future geohazard modeling, especially in areas facing similar challenges with dataset characteristics.

Study area

This study focuses on the Chamoli district, as shown in Fig. 1, the second largest district in the Uttarakhand state of India, with a total geographical area of approximately 7820 km2. Characterized by its predominantly mountainous terrain, Chamoli exhibits an agrarian economy, with forest cover constituting the primary land use at 58.38%. The district experiences a diverse climate, oscillating between subtropical monsoon and tropical upland types, and maintains a relative humidity fluctuating between 35 and 70% annually (CGWB 2022). The landscape of the Chamoli region is characterized by elevated hills and mountains, interspersed with narrow valleys and deep gorges, with certain areas enduring perennial snow cover.

Fig. 1
figure 1

Chamoli district, Uttarakhand, India, 30.2937° N, 79.5603° E (Esri India 2021)

The district hosts the vital arterial road connecting Rishikesh to Badrinath. This revered Hindu pilgrimage circuit encompasses four seminal temples nestled in the Garhwal Himalayas: Kedarnath, Badrinath, Gangotri, and Yamunotri. Given the influx of pilgrims attracted to this route, it is crucial to ensure its safety.

In this terrain, landslides emerge as a recurrent and formidable hazard, imperiling the pilgrims and the intricate infrastructure of Chamoli's road network (Roushan 2023). The Chamoli Rock-Ice Avalanche in 2021 demonstrated the severity of such hazards, claiming 78 lives and leaving 204 individuals unaccounted for (Vangla et al. 2022). In addition, it likely heralds a new causative factor of climate change and its associated factors (elevated temperatures, glacial melting, and flooding amongst others) that has not been explicitly relevant in regions of past studies. Consequently, the development of accurate LSM’s for the Chamoli and similar Himalayan foothill districts has escalated in priority and is poised to play a pivotal role in pinpointing vulnerable zones and devising effective mitigation strategies.

Data source and materials

This study examines eleven causative factors for the landslide, including elevation, aspect, slope, geology, geomorphology, distance to lineament, lithology, NDVI, distance to river, distance to road and residential land use, as shown in Fig. 2. These factors are selected to provide comprehensive geological, hydrological, land cover, and morphological insights, marking an initial effort to forecast events in regions affected by climate change stressors (Lima et al. 2022; Reichenbach et al. 2018).

Fig. 2
figure 2figure 2

Landslide causative factors. a elevation; b aspect; c slope; d geology; e geomorphology; f distance to lineament; g lithology; h NDVI; i distance to river; j distance to road; k residential land use; l Ground truth landslide map

The elevation determines the distribution of the surface and directly provides the gravitational energy that initializes a landslide (Das et al. 2023; Youssef et al. 2016). Aspect is tied with the insolation, rainfall, and land use pattern, which will indirectly impact a landslide (Das et al. 2023). The slope is the critical factor, considering that there is a positive correlation between landslide and slope angle (Das et al. 2023; Felicísimo et al. 2013). Geology significantly influences landslide risk due to slope-forming material, soil type, weathering, and erosion conditions (Das et al. 2023). Geomorphology represents the comprehensive collection of knowledge regarding the terrain and, therefore, plays a role in assessing the risk of landslides (Das et al. 2023). Lineaments are mappable features on a surface, often indicative of underlying structures such as fractures, faults, and shear zones. Lineaments can indicate areas of weakness, which contribute to higher landslide risk in their vicinity (Chauhan et al. 2010; O'leary et al. 1976). The resistance of lithology units to landslides can vary due to differences in their structure and compactness (Chauhan et al. 2010). NDVI shows the growth conditions and the concentration of vegetation. A low NDVI, indicative of sparse vegetation, will increase landslide susceptibility due to reduced water retention and weakened soil stability (Pradhan and Lee 2010). The distance to a river is an important factor to consider when assessing landslide susceptibility since rivers can disrupt the equilibrium of forces within a slope, alter the weight of materials, and cause erosion (Gupta and Shukla 2022). The intensity of human activities can be reflected by land use and distance to roads. Urbanization can potentially increase the susceptibility of landslides in mountainous regions like the Chamoli district, highlighting the importance of these factors (Singh et al. 2021).

In this study, the elevation data is obtained from the Shuttle Radar Topography Mission (SRTM) 1 Arc-Second Global dataset provided by USGS, which offers a 30-m resolution (USGS 2018). Aspect and slope are derived using ArcGIS Pro based on the elevation data. NDVI is derived from Sentinel-2 data with a resolution of 10 m (EOB 2023). The remaining data related to causative factors are acquired from "Bhukosh", a portal for geoscientific data of the Geological Survey of India, and other relevant Government of India websites. These datasets were obtained as a shapefile and accessed in July 2022 (IMD 2023). All data is re-projected, resampled to 50 m resolution, and rasterized using ArcGIS Pro. Therefore, the resolution of the LSM produced in this study is 50 m by 50 m. Table 2 further elucidates the features of each landslide causative factor stored in the dataset. Future studies are expected to introduce additional climate change stressor factors to further enhance the predictions based on other similar events such as Sikkim Glacial Lake Out-Flow (GLOF) (Sattar 2023).

Table 2 The landslide causative factors in the dataset

This work utilizes historical landslide data to identify positive landslide points. The data is sourced from the Geological Survey of India (Geosadak 2023) and news outlets through crowdsourcing. Every landslide point or polygon is buffered to form a polygon with a diameter of 500 m. A 500-m buffer is selected to ensure a conservative manner in generating LSM, capturing a broader area that might be influenced by historical landslides. Furthermore, the 500-m buffer is used consistently throughout the study to avoid introducing any other variables in the analysis. This uniform buffer is particularly applied when generating categorical features such as Distance to lineament, Distance to road, and Distance to river, as shown in Table 2.

Methodology

PU learning

The PU learning algorithm employed in this study originates from Learning Classifiers from Only Positive and Unlabeled Data (Elkan and Noto 2008). This algorithm assumes that the selection of positive labeled data points is performed randomly from the entire pool of positive data points. This assumption serves as the basis for the derivation of the underlying lemma:

$$P\left( {y = 1{|}x} \right) = \frac{P(s = 1|x)}{{P(s = 1|y = 1)}}$$
(1)

In Eq. 1, x represents individual data points and y denotes the binary label [0,1]. The variable s serves as the indicator for whether a data point is labeled. If the data point is labeled, s = 1; otherwise, s = 0. It is important to note that s will only be assigned a value of 1 when y is 1.

\(P\left(y=1|x\right)\) is calculated based on the lemma, and the corresponding algorithms are shown in Fig. 3.

Fig. 3
figure 3

PU learning algorithm

In essence, this algorithm and its foundational lemma enable the computation of the probability of a data point being positively labeled, leveraging the relationships between the labeled data points and the overall dataset.

In this study, data points that correspond to areas where a landslide has previously occurred are classified as positive points, whereas data points that do not have a history of landslides are classified as unlabeled points.

Spatial correlation: train test split

Numerous studies have employed a random dataset split into training and testing sets for LSM generation using machine learning techniques (Kainthura and Sharma 2022; Saha et al. 2020; Solanki et al. 2022). However, such an approach fails to account for the spatial correlation within the spatial data, leading to an overly optimistic model performance (Meyer et al. 2019). To this end, this study divides the dataset into four equally sized sections along the latitude, denoted as X1, X2, X3, and X4, as shown in Fig. 4, with each section serving as a testing set in turn while the model is trained on the remaining sections. Considering PU learning's focus on positive points as the source of known knowledge, the positive point ratio, defined as the percentage of positive points amongst the four sections, is calculated and presented in Table 3. Although the dataset splitting method ensures a balanced number of data points in each dataset, an imbalance is still observed in the positive point ratio, with X1 containing only 13.55% of positive points, whereas X3 has 39.55%. This issue is further discussed in the following section.

Fig. 4
figure 4

Spatial train test split

Table 3 Positive point ratio in each set

Performance metrics

Three performance metrics are used to evaluate the model performance in this work, namely recall, AUC, and accuracy.

Recall, also called as true positive rate, measures the proportion of actual positive instances that were correctly identified by a classifier. Recall is a particularly important metric in PU learning considering the nature of the data, since only positive instances are labeled.

AUC is generated from ROC curve which is used to assess the performance of binary classifiers by plotting true positive rate against the false positive rate. False positive rate is the proportion of actual negative instances that were incorrectly identified as positive. AUC measures the total area underneath the ROC curve. An AUC of 1.0 denotes a perfect classifier, while an AUC of 0.5 indicates a classifier performing no better than random chance.

Accuracy is one of the most common metrics to evaluate classification models. It represents the fraction of predictions that are correct out of all the predictions made. In PU learning, where data consists of positive and unlabeled instances, the unlabeled points are often treated as negative for accuracy calculations. This approach allows for the swift establishment of a baseline model.

Results

In this study, five experiments are conducted using different train and test datasets, namely random splitting, using X1 to test and X2, X3, X4 to train, using X2 to test and X1, X3, X4 to train, using X3 to test and X1, X2, X4 to train, using X4 to test and X1, X2, X3 to train. For each experiment, RF, KNN, and decision tree models are trained with and without the PU learning algorithm. The recall metric measures the proportion of actual positive points that are correctly predicted, which is considered the most important metric for this dataset, as discussed in "Performance metrics" section. The model performance is presented in Table 4, with the bold value indicating the highest performance of each metric in every experiment. Notably, using the PU learning algorithm improves the recall and AUC performance in every experiment, with a particularly pronounced improvement in the recall value. Among the models evaluated, the RF model with the PU learning algorithm reaches an average recall of 56.67%, AUC of 69.37%, and accuracy of 79.28% when splitting the dataset spatially. This model consistently surpasses others in recall, all the while sustaining high accuracy levels.

Table 4 Performance of machine learning models with and without P.U. learning algorithms

Figure 5 depicts the performance of RF with and without PU learning in each experiment. It is evident from Fig. 5 that the accuracy remains relatively stable in the first four sets of columns, which represent random split, X1, X2, and X3 as test sets, respectively. However, there is a notable improvement in recall with the inclusion of PU learning, indicating the enhanced ability of models with PU Learning to correctly identify positive instances.

Fig. 5
figure 5

Accuracy and recall of RF with and without P.U. learning algorithm

A different trend is observed in the X4 data set, where applying P.U. learning results in a drop in accuracy, despite an increase in recall. This decrease in accuracy in the X4 set can be attributed to the presence of a higher number of points prone to landslides, even though landslides have not occurred at those points. The model, with its heightened sensitivity to positive instances through PU learning, predicts more of these unlabeled points as positive. This results in a decrease in accuracy, as these unlabeled points are treated as negatives in the accuracy calculation.

Figure 6a–d illustrates the visualization of the landslide susceptibility degree when utilizing RF with PU learning on X1, X2, X3, and X4 as test sets, respectively. Figure 6e combines the test results into a singular map and Fig. 6f overlays Fig. 6e together with ground truth landslide map. The probability output by RF with PU learning is categorized into five classes, namely very low susceptible (VLS), low susceptible (LS), medium susceptible (MS), high susceptible (HS), and very high susceptible (VHS) based on Jenk's natural break. As observed in Fig. 6e, f, the X4 region, especially the western area, has a higher concentration of VHS areas, but less historical landslide points, which coincides with the model's performance depicted in Fig. 5.

Fig. 6
figure 6

Visualization of the LSM of Chamoli. a X1; b X2; c X3; d X4; e LSM of Chamoli; f LSM of Chamoli and ground truth landslide map

Discussion

Balanced positive points distribution

As outlined in "Spatial correlation: train test split" section, there is an imbalanced distribution of positive points across X1, X2, X3, and X4. Since the model relies heavily on positive points, this imbalance can negatively impact the model's performance. To address this issue, the range of X1, X2, X3, and X4 is adjusted based on latitude to ensure that each set has a roughly equal number of positive points. Table 5 shows the corresponding positive point ratios after this adjustment. Compared to the original split in Table 3, the positive point ratio in X1 has significantly increased, while that in X3 has decreased following the adjustment.

Table 5 Positive point ratio of revised train test split

After obtaining the new data split, four experiments were conducted using revised train and test datasets. These experiments involved using revised X1 to test and revised X2, X3, X4 to train, using revised X2 to test and revised X1, X3, X4 to train, using revised X3 to test and revised X1, X2, X4 to train, and using revised X4 to test and revised X1, X2, X3 to train. As demonstrated in "Results" section, RF generally performs better than KNN and decision tree models. Therefore, for each experiment, an RF model was trained both with and without the PU learning algorithm. Table 6 displays the performance of the models using the revised dataset, with the bold value highlighting the highest performance of each metric in every experiment. Consistent with the results in "Results" section, using the PU learning algorithm enhances the recall and AUC performance in all experiments, particularly the recall value. Moreover, the revised X4 exhibits a significant decrease in accuracy after using the PU learning algorithm, which also aligns with the findings in "Results" section.

Table 6 The performance of RF model with and without PU learning algorithm on revised datasets

Figure 7 compares accuracy and recall for the RF with the PU learning algorithm between the original and revised datasets. The accuracy remains roughly consistent with and without adjustment across all experiments. However, there is a decrease in recall for X1 and an increase in recall for X3, while the recall for X2 and X4 remains relatively unchanged. This shift in recall aligns with the fact that the number of positive points in X1 increases while the number of positive points in X3 decreases after the adjustment. Overall, the performance of the revised model demonstrates that the results in "Results" section were not significantly influenced by the imbalanced distribution of positive points in each set.

Fig. 7
figure 7

The comparison of accuracy and recall on the original dataset and the revised dataset. a Accuracy; b Recall

Human activities and HS and VHS area

Figure 8 shows Chamoli LSM overlaying with road map and residential land use map. From Fig. 6e and Fig. 8, it is evident that the VHS area aligns with the road map in Chamoli. Also, the Southwestern region of Chamoli exhibits heightened susceptibility to landslides, which aligns with the residential land use map. It is noteworthy that human activities may lead to greater susceptibility to landslides, while urbanization is also vulnerable to such incidents. To this end, two indices, namely Human Activity-Landslide Susceptibility Alignment (HA-LSA) Index and Landslide Threat Coverage (LTC) Index, are derived for road and residential land use, respectively. The equations below show the definitions of HA-LSA Index and LTC Index. It should be noted that human activities could refer to road use or residential land use. The HA-LSA Index shows the consistency between human activities and HS, VHS area, shedding light on how human activities may exacerbate landslide susceptibility, as shown in Eq. 2. The LTC Index shows the percentage of human activities under the threat of landslides, as shown in Eq. 3. Table 7 shows the results for road and residential land use.

$$HA - LSA\; index = \frac{The\;overlap\;between\;human\;activities\;and\;HS,\;VHS\;area}{{HS,\;VHS\;area}}$$
(2)
$$LTC\;index = \frac{The\;overlap\;between\;human\;activities\;and\;HS,\;VHS\;area}{{human\;activities\;area}}$$
(3)
Fig. 8
figure 8

LSM of Chamoli with road and residential land use map

Table 7 HA-LSA Index and LTC Index for roads and residential land use

The indices derived from road and residential land use data indicate a notable overlap between human activities and HS and VHS areas, with values ranging from 50 to 60% for HA-LSA Index. This observation underscores the potential of human activities, such as the construction of roads and residential areas, to increase the susceptibility of the region to landslides. However, the more concerning finding is that LTC Index reveals that over 50% of the roads and over 30% of the residential areas are currently under the threat of landslides. Considering the seriousness of this issue, it's crucial for local government officials to consider immediate action.

P > 1 in PU learning case

In this work, landslide susceptibility is classified into five degrees according to the probability derived from the RF model with the PU learning algorithm. However, for certain data points, particularly those in the HS and VHS areas, the probability may exceed 1. This is due to the assumption of the PU learning algorithm that the positive labeled data points are selected randomly from the entire pool of positive data points. If the assumption does not hold true, \(P(s=1|x)\) could be larger than \(P(s=1|y=1)\), leading to a \(P\left(y=1|x\right)\) being greater than 1. However, it is hard for a real-world dataset to perfectly fit into this assumption. In the Chamoli district, there is a higher chance for the HS and VHS areas to have landslides (positive points) than the VLS, LS, and MS areas. Therefore, there are points with a probability larger than one in the HS, and VHS areas. Despite this limitation, the conclusion drawn from the model remains valid since the probability range still captures the different degrees of landslide susceptibility. In future studies, adjusting the PU learning algorithm to fit the landslide scenario better to obtain more accurate probability estimates may be beneficial.

Conclusions

This study employed eleven causative factors to generate LSM of Chamoli district, Uttarakhand, India. These factors include elevation, aspect, slope, geology, geomorphology, distance to lineament, lithology, NDVI, distance to river, distance to road and residential land use. The goal of the current study is to improve the base methodology for predicting landslides in susceptible regions where climate change stressors are expected to add additional causative factors. Subsequent enhancements to incorporate new causative factor resulting from climate change and its associated factors (elevated temperatures, glacial melting, and flooding amongst others) are anticipated. Based on this initial analysis, the following eight conclusions were drawn:

  • The use of machine learning to generate a LSM is more accurately classified as a PU learning problem rather than a binary classification problem due to the nature of landslide data points. The absence of historical landslide events does not necessarily indicate a lack of susceptibility to landslides. Hence, it should be considered as unlabeled instead of being labeled as negative.

  • When generating LSM with machine learning methods, it is preferable to partition the dataset into training and testing sets based on spatial criteria rather than random selection due to the inherent spatial correlation in geospatial datasets.

  • It is noteworthy that in the context of PU learning, recall is a particularly significant metric. Compared with the original machine learning model, using PU learning can significantly increase the recall, indicating that the model can better identify the features of areas susceptible to landslides.

  • The RF model with the PU learning algorithm achieves the highest performance for the Chamoli landslide dataset, demonstrating an average recall of 56.67%, AUC of 69.37%, and accuracy of 79.28% when the dataset is spatially split.

  • The model achieves a higher recall but a much lower accuracy when using PU learning for X4 as the test set. This suggests that the southern part of Chamoli (X4) may be highly susceptible to landslides, including some areas where landslides have not occurred before, indicating a need for increased attention from local authorities.

  • The imbalance of positive points in the Chamoli landslide dataset does not significantly impact the performance of the model, indicating the robustness of the approach.

  • Roads and residential land use align with the HS and VHS area distributions, which could indicate that human activities cause a higher landslide susceptibility. Also, a large percentage of roads and residential land use are within HS and VHS areas, potentially threatening life and the local economy.

  • The landslide dataset does not perfectly fit into the PU learning assumption, which leads to some points having a probability larger than 1. However, this does not affect the quality of the LSM’s, and the algorithm can be further improved to better reflect the true probability of landslides occurring in an area.

  • Ongoing studies are focused on introducing additional climate change stressors as factors in the PU learning method to further enhance the prediction method and will be presented in future publications.

Availability of data and materials

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

References

  • Badola S, Mishra VN, Parkash S (2023) Landslide susceptibility mapping using XGBoost machine learning method. In: 2023 international conference on machine intelligence for geoanalytics and remote sensing (MIGARS)

  • CGWB (2022) Chamoli district at a glance. Retrieved June 20, 2023 from http://cgwb.gov.in/sites/default/files/2022-10/chamoli.pdf

  • Chauhan S, Sharma M, Arora MK (2010) Landslide susceptibility zonation of the Chamoli region, Garhwal Himalayas, using logistic regression model. Landslides 7:411–423

    Article  Google Scholar 

  • Das S, Sarkar S, Kanungo DP (2023) A critical review on landslide susceptibility zonation: recent trends, techniques, and practices in Indian Himalaya. Nat Hazards 115(1):23–72

    Article  Google Scholar 

  • Elkan C, Noto K (2008) Learning classifiers from only positive and unlabeled data. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

  • EOB (2023) EO Browser. Sinergise Ltd. Retrieved June 20, 2022 from https://apps.sentinel-hub.com/eo-browser/

  • Esri India (2021) Uttarakhand: Education. ArcGIS Hub. Retrieved June 20, 2022 from https://hub.arcgis.com/datasets/b2f4e9476e9c4e8ab9e9e76e740054e1_0/about

  • Fang Z, Wang Y, Niu R, Peng L (2021) Landslide susceptibility prediction based on positive unlabeled learning coupled with adaptive sampling. IEEE J Sel Top Appl Earth Obs Remote Sens 14:11581–11592

    Article  Google Scholar 

  • Felicísimo ÁM, Cuartero A, Remondo J, Quirós E (2013) Mapping landslide susceptibility with logistic regression, multiple adaptive regression splines, classification and regression trees, and maximum entropy methods: a comparative study. Landslides 10:175–189

    Article  Google Scholar 

  • Geosadak (2023) Geosakad PMGSY National GIS. Retrieved June 20, 2022 from https://geosadak-pmgsy.nic.in/opendata/

  • Ghosh JK, Bhattacharya D (2010) Knowledge-based landslide susceptibility zonation system. J Comput Civ Eng 24(4):325–334

    Article  Google Scholar 

  • Gupta SK, Shukla DP (2022) Effect of scale and mapping unit on landslide susceptibility mapping of Mandakini River Basin, Uttarakhand, India. Environ Earth Sci 81(14):373

    Article  Google Scholar 

  • IMD (2023) India Meteorological Department, Ministry of Earth Sciences, Government of India - Home page. Retrieved June 20, 2022 from https://mausam.imd.gov.in/

  • Juyal A, Sharma S (2021) A study of landslide susceptibility mapping using machine learning approach. In: 2021 third international conference on intelligent communication technologies and virtual mobile networks (ICICV)

  • Kainthura P, Sharma N (2022) Machine learning driven landslide susceptibility prediction for the Uttarkashi region of Uttarakhand in India. Georisk Assess Manag Risk Eng Syst Geohazards 16(3):570–583

    Article  Google Scholar 

  • Lima P, Steger S, Glade T, Murillo-García FG (2022) Literature review and bibliometric analysis on data-driven assessment of landslide susceptibility. J Mt Sci 19(6):1670–1698. https://doi.org/10.1007/s11629-021-7254-9

    Article  Google Scholar 

  • Meyer H, Reudenbach C, Wöllauer S, Nauss T (2019) Importance of spatial predictor variable selection in machine learning applications–moving from data reproduction to spatial prediction. Ecol Model 411:108815

    Article  Google Scholar 

  • O’leary D, Friedman J, Pohn H (1976) Lineament, linear, lineation: some proposed new standards for old terms. Geol Soc Am Bull 87(10):1463–1469

    Article  Google Scholar 

  • Parkash S (2011) Historical records of socio-economically significant landslides in India. J South Asia Disaster Stud 4(2):177–204

    Google Scholar 

  • Pradhan B, Lee S (2010) Landslide susceptibility assessment and factor effect analysis: backpropagation artificial neural networks and their comparison with frequency ratio and bivariate logistic regression modelling. Environ Model Softw 25(6):747–759

    Article  Google Scholar 

  • Reichenbach P, Rossi M, Malamud BD, Mihir M, Guzzetti F (2018) A review of statistically-based landslide susceptibility models. Earth Sci Rev 180:60–91. https://doi.org/10.1016/j.earscirev.2018.03.001

    Article  Google Scholar 

  • Roushan A (2023) Uttarakhand: Rishikesh-Badrinath highway reopened for traffic after 17-hour closure due to landslide. Retrieved August from https://www.indiatvnews.com/news/india/rishikesh-badrinath-highway-reopened-for-traffic-closure-due-to-landslide-uttarakhand-rains-debris-on-road-2023-08-08-885574

  • Saha S, Saha A, Hembram TK, Pradhan B, Alamri AM (2020) Evaluating the performance of individual and novel ensemble of machine learning and statistical models for landslide susceptibility assessment at Rudraprayag District of Garhwal Himalaya. Appl Sci 10(11):3772

    Article  CAS  Google Scholar 

  • Saha S, Arabameri A, Saha A, Blaschke T, Ngo PTT, Nhu VH, Band SS (2021) Prediction of landslide susceptibility in Rudraprayag, India using novel ensemble of conditional probability and boosted regression tree-based on cross-validation method. Sci Total Environ 764:142928

    Article  CAS  Google Scholar 

  • Sangeeta, Maheshwari B (2019) Earthquake-induced landslide hazard assessment of chamoli district, uttarakhand using relative frequency ratio method. Indian Geotech J 49:108–123

    Article  Google Scholar 

  • Sarkar S, Kanungo DP, Patra A, Kumar P (2008) GIS based spatial data analysis for landslide susceptibility mapping. J Mt Sci 5:52–62

    Article  Google Scholar 

  • Sattar A (2023) Sikkim glacial lake outburst spotlights climate vulnerability of the Himalayas. Nature India

  • Singh A, Pal S, Kanungo D (2021) An integrated approach for landslide susceptibility–vulnerability–risk assessment of building infrastructures in hilly regions of India. Environ Dev Sustain 23(4):5058–5095

    Article  Google Scholar 

  • Solanki A, Gupta V, Joshi M (2022) Application of machine learning algorithms in landslide susceptibility mapping, Kali valley, Kumaun Himalaya, India. Geocarto Int 37(27):16846–16871

    Article  Google Scholar 

  • USGS (2018) USGS EROS Archive - digital elevation - shuttle radar topography mission (SRTM) 1 Arc-Second Global. Retrieved June 20, 2022

  • Vangla P, Jindal D, Sajwan A, Devrani R (2022) Site visit and drone based reconnaissance study of the severely affected infrastructure by February 7th, 2021, Chamoli rock-ice avalanche disaster, Uttarakhand, India. In: Report: geotechnical extreme events reconnaissance. GEER) Association. https://doi.org/10.18118/G6V95B

  • Vijith H, Rejith P, Madhu G (2009) Using InfoVal method and GIS techniques for the spatial modelling of landslide susceptibility in the upper catchment of river Meenachil in Kerala. J Indian Soc Remote Sens 37:241–250

    Article  Google Scholar 

  • Wu B, Qiu W, Jia J, Liu N (2020) Landslide susceptibility modeling using bagging-based positive-unlabeled learning. IEEE Geosci Remote Sens Lett 18(5):766–770

    Article  Google Scholar 

  • Yao J, Qin S, Qiao S, Liu X, Zhang L, Chen J (2022) Application of a two-step sampling strategy based on deep neural network for landslide susceptibility mapping. Bull Eng Geol Env 81(4):148

    Article  Google Scholar 

  • Youssef AM, Pourghasemi HR, Pourtaghi ZS, Al-Katheeri MM (2016) Landslide susceptibility mapping using random forest, boosted regression tree, classification and regression tree, and general linear models and comparison of their performance at Wadi Tayyah Basin, Asir Region, Saudi Arabia. Landslides 13:839–856

    Article  Google Scholar 

Download references

Acknowledgements

The primary data used in this study was sourced from Bhukosh and Geosadak. The efforts of the database developers are gratefully acknowledged.

Funding

The research described in this manuscript is an outcome of a collaborative project supported by the Scheme for Promotion of Academic and Research Collaboration (SPARC) Project #P83, Ministry of Education (MoE), India. Additional support was provided by the US National Science Foundation through Grant CMMI 1826118. Additional support was provided by the Elizabeth and Bill Higginbotham Professorship at Georgia Tech (GTF 200000340) and is gratefully acknowledged. The opinions are those of the authors and not the sponsors.

Author information

Authors and Affiliations

Authors

Contributions

DF and PV established the research framework and revised the paper. DZ and DJ collected and washed the data. DZ ran the models. DZ, DJ, NR visualized and analyzed the results from the model.

Corresponding author

Correspondence to Danrong Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, D., Jindal, D., Roy, N. et al. Enhancing landslide susceptibility mapping using a positive-unlabeled machine learning approach: a case study in Chamoli, India. Geoenviron Disasters 11, 21 (2024). https://doi.org/10.1186/s40677-024-00281-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40677-024-00281-w

Keywords