Skip to main content

Comparison of factors influencing landslide risk near a forest road in Chungju-si, South Korea



The study aimed to identify the influential factors required to prepare landslide vulnerability maps and establish disaster prevention measures for mountainous areas with forest roads. The target area is Sancheok-myeon, Chungju-si, where several landslides have occurred in a narrow area of approximately 3 km × 4 km. As the area has the same rainfall and vegetation conditions, the influences of the physico-mechanical characteristics of the soil in accordance with compaction and topographic characteristics could be analyzed precisely.


Geological surveying, sampling, and laboratory testing assessed landslide risk in the study area, and data including unit weight, specific gravity, porosity, water content, soil depth, friction angle, cohesion, slope angle, profile/plan curvature, TWI were obtained. Preprocessing and screening such as min-max normalization and multicollinearity were conducted for the data in order to eliminate overestimation of each factor’s effectiveness. The influence of each factor was analyzed using logistic regression (LR), structural equation modeling (SEM), extreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM).


All methods showed that soil depth has the greatest impact on landslide occurrence. Friction angle, slope angle, and porosity were also selected as influential factors, although each method ranked them slightly differently. Topographic factors, such as plan curvature, profile curvature, and the topographic wetness index, had minimal influence. This appears to be because landslides near forest roads are more affected by how well compaction was performed during banking than by the concave or convex shape of the slope. This study presents analysis results for an area with the same rainfall and vegetation conditions; therefore, the analysis of the influence of the physico-mechanical characteristics of the soil and topography was more precise than when comparing landslides occurring in different regions. Our results may be helpful in preparing landslide vulnerability maps.


Landslides constitute one of the most dangerous types of natural disaster. They are known to have caused 838 annual deaths globally between 2002 and 2021 (CRED 2023). The precise mapping of landslide susceptibility and methods to assess landslide risk to decrease their potential damage have received substantial research attention. However, predicting landslide occurrence remains difficult despite sustained research efforts, because it is affected by complex interactions among many factors, including geological conditions, geomorphology, climate, earthquakes, and vegetation (Gerrard and Gardner 2002; Wobus et al. 2003; Hasegawa et al. 2009). The main factors influencing landslide occurrence and the relationships among them remain unclear without rainfall factor, thus hindering precise landslide prediction (John and Douglas 2012).

With reference to analysis methods related to landslide, statistical methods including conditional probability, weight of evidence, frequency ratio (FR), and logistic regression (LR) were typically used in the 1990s and 2000s to analyze the influences of factors causing landslides and to predict landslides. Machine learning methods, such as artificial neural networks and deep learning, have been used since the 2010s. For example, EKER and Aydin (2014) prepared a landslide vulnerability map in an analysis of landslide vulnerability for different road types (e.g., forest roads and expressways) by conducting geographic-information-system-based LR analysis of land use, petrology, elevation, slope, side, distance to rivers, distance to roads, and plan curvature. Pham et al. (2016) assessed landslide vulnerability in 930 landslide areas by analyzing Google images using support vector machine, LR, Fisher’s linear discriminant analysis, Bayesian network, and naïve Bayes techniques. Wang et al. (2016) proposed a landslide prediction model using LR, FR, decision tree, weight of evidence, and artificial neural networks. Chen et al. (2018) proposed a landslide vulnerability model using a random forest (RF) algorithm based on a digital elevation model and Landsat-8 data. Xiao et al. (2020) proposed a landslide vulnerability model using hybrid models combining RF, FR, CF (certainty factor), and the index of entropy (IOE), namely RF-FR, RF-CF, RF-IOE, IOE-CF, and CF-FR. Further methods—such as big data, machine learning, and deep learning methods, which may overcome existing mathematical and engineering limitations—have been actively used in recent years. Representative prediction models include boosting-based models, such as extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and category boost (CatBoost) (Chen and Guestrin 2016; Ke et al. 2017; Prokhorenkova et al. 2017).

In relation to influential factors of landslide, precipitation or rainfall intensity was pointed out as the most influential factors on landslide in research cases using data-driven analysis because rainfall conditions are different with each other due to greater distance between data collection points as numerous landslide cases need to be analyzed (Chae et al. 2004; Quan et al. 2011; Chen et al. 2013). Also, in physically based analysis, the friction angle and cohesion included in the slope stability analysis equation were evaluated as the most dominant factors on landslide (Regmi et al. 2010; Qu et al. 2021). Besides, it was reported that the stability of slope with complete vegetation cover is higher than that of slope with meagre vegetation (Schmidt et al. 2001; Osman and Barakbah 2006), and there were substantial interests in the effects of forest roads hydrologically and geomorphically on earth surface and landforms (Luce and Wemple 2000; Dutton et al. 2005). Vanacker et al. (2005) reported that changes in the forest landscape or large-scale logging, which change the soil infiltration and ground evapotranspiration rates, thus indirectly affect the water contents in soil and reduce slope stability. It was reported that the soils from landslide prone areas were mainly silty soils with low plasticity (Jotisankasa and Vathananukij 2008). Nugraha et al. (2015) argued that land surface (geomorphometric) characteristics have a significant relationship with the landslide distribution, and even others have emphasized the role of investigating topographic influence (Fernandes et al. 2004; Broothaerts et al. 2012). In addition, Owen (1981) said that the sunny aspects were much more susceptible to landslide than the shady aspects. When the soils are saturated, the liquid limit water content of the sunny aspect subsoil is exceeded, while that of the shady aspect subsoil is not. Meanwhile, Kimaro et al. (2000) suggested that the most important soil characteristics is presence of saprolite or boundary with hard bed rock. As mentioned above, the most influential factors are differently evaluated depending on researcher’s perspectives because various factors including vegetation, climate, geology, topography and so on, affect landslides.

Throughout Korea’s many mountainous areas, several forest roads have been constructed for forest management. Construction of these roads involves the formation of cutting and banking slopes, which affect slope stability by changing the ground, topography, and water flow (Wempleet al. 1996; Choi et al. 2011). In addition, a recent increase in guerrilla rainstorms caused by climate change has increased landslide risk. According to the Korea Forest Service, the frequency of landslides is increasing annually, and the size of a landslide depends on the region and season, with typhoons and heavy rainfall being concentrated in summer (Korea Forest Service 2021). Therefore, previous regional landslide analyses have focused primarily on rainfall, which is an external factor; therefore, detailed risk plan considering internal factors has not been facilitated when activities such as the selection of areas for reinforcement and the establishment of disaster prevention measures are conducted. The present study analyzes the effects of soil, topographic factors, and rainfall on landslides using statistical and machine learning methods to identify major influencing factors. The results may aid in the preparation of landslide vulnerability maps and establish disaster prevention measures (e.g., prioritizing areas for reinforcement) within budgetary constraints.


The theory of logistic regression analysis

Logistic regression (LR) analysis determines correlations between a dependent variable and multiple independent variables influencing it. The probability of an event can be calculated and expressed as a value between 0 and 1. Values can be binarized by those ≥ 0.5 being assigned as 1, and those < 0.5 being assigned as 0. The probability of an event through LR analysis (\({P}_{Z}\)) is given by Eqs. (1) and (2):

$$Z=\alpha +{\beta }_{1}{X}_{1}+\cdots +{\beta }_{n}{X}_{n}$$

where \(Z\) is the LR, \(\alpha\) is a constant, and \({\beta }_{n}\) is the regression coefficient of the independent variable (\({X}_{n}\)).

The Nagelkerke R-squared, Hosmer and Lemeshow, and confusion matrix verification methods were used to analyze the reliability of the results of LR analysis. Nagelkerke R-squared indicates the degree to which independent variables can explain the dependent variable. A value of ≥ 20% indicates that the independent variables have explanatory power. The Hosmer and Lemeshow test determines the overall goodness-of-fit of a regression model. A significance level of > 0.05 indicates that the model has explanatory power. The confusion matrix estimates prediction accuracy as the area under the curve (AUC) calculated for an ROC curve with an X-axis of “1-specificity” and a Y-axis of “sensitivity”. Values of AUC are distributed between 0 and 1, with a value close to 1 indicating an accurate model (Fawcett 2006; Godt et al. 2008; Šimundić 2009). Accuracy, specificity, and sensitivity can be calculated using Eqs. (3), (4), and (5), respectively:


where a true positive (TP) is the correct prediction of a positive value, a true negative (TN) is a correct prediction of a negative value, a false positive (FP) is a negative value incorrectly predicted as positive, and a false negative (FN) is a positive value incorrectly predicted as negative.

The theory of structural equation model

The SEM used here was first suggested by Wright (1921). It is a path-analysis-based statistical method used to identify causal relationships among multiple variables with complex interrelationships and in cases with many independent and dependent variables. Although it seems similar to multiple regression analysis, it is a more detailed model because it can consider the mutual influences of all variables and can easily identify interrelationships among variables using graphical representation (Hox and Bechger 1999; Yung 2008; Ullman and Bentler 2013). The SEM can be subdivided into a measurement model for confirmatory factor analysis and a theoretical model for multiple regression and path analyses. The former is applied when each variable can explain latent variables perfectly, and the latter is used to group variables into representative latent variables and to find the most descriptive model. When studying landslides, the latter model is more suitable than the former, as many factors related to slope failure or landslide occurrence are difficult to explain fully, and there are limitations related to ground heterogeneity.

Theoretical modeling in an SEM is calculated using partial least squares (PLS), which are subdivided into partial least squares regression (PLS-R) and partial least squares path models (PLS-PMs). PLS-R is used when there are more variables than the number of data, and PLS-PMs are applied to analyze interrelationships. This study applies a PLS-PM, which can deal with a large amount of data and analyze interrelationships and causal analysis among influential factors related to landslide occurrence. PLS-PM analysis is a multivariate analysis technique that analyzes the systems of relationships among many blocks containing variables. This approach follows the component-based estimation procedure, and it is defined by the following two basic concepts: each block of variables acts as a latent variable, and it is assumed that there is a system of linear relationships between blocks. The analysis considers multiple relationships between variable blocks, and each variable block is assumed to be represented by a latent variable or theoretical concept. Here, each latent variable is a hypothetical variable created to generate an SEM. The latent variables are grouped with variables that have similar characteristics. The determination and interrelationships of the latent variables are set by the researcher’s subjective judgment, and continuous modification and supplementation are required until an optimal model is developed.

Figure 1 outlines the PLS-PM, showing the constituent manifest (dependent and independent variables, Xij) and latent variables. The first step in PLS-PM analysis is to set the latent variables—which comprise manifest variables with similar characteristics (Fig. 1a)—and then the internal model is determined based on the latent variables of the external model (Fig. 1b). The setup of the external and internal models is then modified and updated until statistically significant results are obtained for the arrangement of variables, causal relationships, and error terms. The last step assesses the confidence of the entire model using the external and internal models estimated from the above two steps (Fig. 1c). Here, weight and loading (α and β, respectively; Fig. 1) are essentially correlation coefficients.

Fig. 1
figure 1

Schematic diagram of the PLS-PM. a The external model comprises manifest variables (dependent and independent variables, Xij) and latent variables. b The internal model consists of latent variables (LVi). c The complete PLS-PM includes both the internal and external models

The theory of XGBoost

XGBoost uses the classification and regression tree (CART) model for the existing gradient boosting algorithm and enables parallel processing, thereby enabling the resolution of various problems using data mining (Chen and Guestrin 2016; DSBA 2020; Yoon 2020; An 2021). Unlike other tree-based learning methods, the learning of XGBoost uses Eq. (6) based on the CART model. When the data comprise input variable x and output variable y, \(\widehat{y}\) is the predicted value of data x, K is the number of CARTs, and f is the CART model (Chen and Guestrin 2016). Equation (7) gives the objective function for training the CART model. Here, \(l\left({y}_{i},\widehat{{y}_{i}}\right)\) is the difference between the actual and predicted values, and \(\Omega\) is the regularization of the model to prevent overfitting. The objective function equation at step t can be expressed using XGBoost’s additive method and Taylor expansion, as shown in Eq. (8). According to the definition of the Taylor expansion, \({g}_{i}\) is the first-order derivative of \({\widehat{{y}_{i}}}^{\left(t-1\right)}\) and can be defined as \({g}_{i}={\delta }_{{\widehat{y}}^{\left(t-1\right)}}l\left({y}_{i},{\widehat{y}}^{\left(t-1\right)}\right)\); \({h}_{i}\) is the second-order partial derivative of \({\widehat{{y}_{i}}}^{\left(t-1\right)}\) and can be defined as \({h}_{i}={\delta }_{{\widehat{y}}^{\left(t-1\right)}}^{2}l\left({y}_{i},{\widehat{y}}^{\left(t-1\right)}\right)\). The greedy and approximate algorithms are used to optimize the prediction model and identify the optimal split point using above equations (Chen and Guestrin 2016).

$$\widehat{{y}_{i}}={\sum }_{k=1}^{K}{f}_{k}\left({x}_{i}\right)$$
$$obj\left(\theta \right)={\sum }_{i}^{n}l\left({y}_{i},\widehat{{y}_{i}}\right)+{\sum }_{k=1}^{K}\Omega \left({f}_{k}\right)$$
$$obj\left(t\right)={\sum }_{i=1}^{n}\left[{g}_{i}{f}_{t}\left({x}_{i}\right)+\frac{1}{2}{h}_{i}{f}^{2}\left({x}_{i}\right)\right]+\Omega \left({f}_{i}\right)$$

The theory of LightGBM

LightGBM is a GBM- and tree-based algorithm that performs learning using residuals. However, unlike the symmetric division method of conventional trees, its tree structure is asymmetrical due to its use of a leaf-wise methodology. LightGBM uses a feature histogram that divides continuous variables into discrete sections (bins) during learning. This method learns a function from slope space g to input space \({X}^{S}\) using a decision tree. In the presence of the training set of n independent and identically distributed entities \(\left\{{x}_{1},{x}_{2},{\cdots x}_{n}\right\}\), \({X}^{S}\) is a vector with a dimension of s. For GBM, the loss function for the model output value generated at each iteration is defined as a negative slope \(\left\{{g}_{1},{g}_{2},{\cdots g}_{n}\right\}\). This model uses Eq. (9) to divide each node through a variable with the largest information acquisition (Ke et al. 2017):

$${V}_{\left.j\right|O}\left(d\right)=\frac{1}{{n}_{O}}\left(\frac{{\left({\sum }_{{x}_{i}\in O:{x}_{ij}\le d}{g}_{j}\right)}^{2}}{{n}_{\left.l\right|O}^{j}\left(d\right)}+\frac{{\left({\sum }_{{x}_{i}\in O:{x}_{ij}\le d}{g}_{j}\right)}^{2}}{{n}_{\left.r\right|O}^{j}\left(d\right)}\right),$$
$${n}_{O}=\sum I\left[{x}_{i}\in O\right],{n}_{\left.l\right|O}^{j}\left(d\right)=\sum I\left[{x}_{i}\in O:{x}_{ij}\le d\right],{n}_{\left.r\right|O}^{j}\left(d\right)=\sum I\left[{x}_{i}\in O:{x}_{ij}>d\right]$$

where O is the training data in the tree node, d is the node, and j is the variable performing division at point d. However, this method is inefficient because it searches all divided sections. To prevent this, gradient-based one-side sampling (a method of reducing the number of data) and exclusive feature bundling (a method of reducing the number of variables) are used.

Study area and data collection

Status of landslide occurrence and sampling locations

To analyze the factors influencing landslide occurrence, site investigation and sampling were conducted in Sancheok-myeon, Chungju-si, South Korea, where a number of landslides occurred within a small area (3 km by 4 km) following 259 mm of rainfall on 2 August 2020. The study area is located in the 37° 06′ 02.6″N ~ 37° 08′ 26.5″ N, 127° 57′ 50.3″E ~ 127° 59′ 13.1″ E, and it has a mountainous terrain due to the surrounding mountains, including Ocheong mountain (EL + 656.8 m) of the northeastern part, Cheondeung mountain (EL + 807.1 m) of the eastern part, and Jangbaek mountain (EL + 405.0 m) of the western part. Also, it is reported that Mesozoic granite is mainly distributed, in the study area and this granite contains some of Precambrian gneiss (Kim 2022).

According to precipitation record of the KMA from 2000 to 2023 (KMA 2023), the average annual precipitation of study area is 1,196 mm, and it is similar to that of South Korea (about 1200 mm). So, this region is an area where disaster caused by rainfall are rare. However, annual average precipitation of 1500 mm, a daily precipitation of 316 mm, and a maximum hourly precipitation of 76.5 mm were all record breaking in 2020 when a lot of landslides occurred.

Figure 2 shows sampling locations and photographs of the landslides that occurred in the study area. Sampling locations are colored either red or yellow: the 40 red points are locations at which landslides occurred, and the 45 yellow points indicate sampling locations where no landslides occurred. Locations with or without landslides were sampled in the one point (in case of occurrence, sampling was conducted in the head part) to provide the statistical and machine learning analyses with the necessary data for both landslide and non-landslide locations. Most landslides in the study area occurred near the forest road (Fig. 2) because the slope angle steepens at the cut slope, and the soil thickness increases where the construction of the road involved slope filling. The upper slopes of forest road comprised weathered soil of biotite granite, and the lower slopes of forest road were made up of embanked soil which was excavated when constructing forest road. Therefore, soil type of landslides was all same with weathered soil of granite, and that it was actually composed mostly of sand with SW-SP (well grade sand-poor grade sand). This led to the mineralogical compositions of soil particles being consistent across the sampling locations, as the roads had been constructed at the same time. The study area allows specific analysis of the influence of topography and physico-mechanical characteristics associated with soil compaction on landslide occurrence owing to vegetation and rainfall conditions being consistent throughout the area.

Fig. 2
figure 2

Locations of sampling points and photographs of landslides along the forest road. Landslides have occurred at 40 of the 85 sampling locations

The dataset comprises the following information for each site: presence or absence of landslide occurrence (hereafter abbreviated as occurrence or non-occurrence); thickness of the soil layer (hereafter abbreviated as soil depth); slope angle; plan curvature and profile curvature; TWI; dry and saturated unit weights of the soil; and porosity, specific gravity, saturated water content, friction angle, and cohesion of the soil. Sampling was conducted in the head parts of areas where landslides occurred. Elevation was not considered, as the sampling points were at similar altitudes. Data for landslide occurrence and soil depth were obtained from site investigations and dynamic cone penetration testing, and physico-mechanical properties (unit weight, specific gravity, porosity, friction angle, and cohesion) were measured according to the test criteria of the American Society for Testing Materials (ASTM D2216-10; ASTM D2487-17; ASTM D3080-98; ASTM D422-63; ASTM D854-10). Topographic characteristics (slope angle, profile, and plan curvatures) were gained from 1:50,000 digital topographic maps of the National Geographic Information Institute and SAGA GIS software (IBM). The profile and plan curvatures describe whether the slope is concave (negative value) or convex (positive value) longitudinally and in cross-section, respectively (Fig. 3). The TWI is an indicator of the wet content of the soil and is calculated using Eq. (10) (Beven and Kirkby 1979):

$$TWI=ln\frac{SCA}{tan\theta }$$

where SCA denotes the local upslope area draining through a certain point per unit contour length, and θ is the local slope in radians. The SCA is calculated using multiple flow directions, as the flow may vary according to the slope’s direction and gradient. Most of the factors are continuous data; the only categorical factor is occurrence (1 for occurrence and 0 for non-occurrence).

Fig. 3
figure 3

Schematic diagrams of a profile and b plan curvature (modified after Dikau 1989)

Distribution of data

The values recorded for the various factors that control landslide occurrence are shown as box-and-whisker plots in Fig. 4. The ends of whiskers indicate the maximum and minimum statistically significant values; any values beyond these ranges were discounted as erroneous outliers. Boxes span the first and third quartiles (the interquartile range); therefore, each box encloses half of the data. The horizontal bar in each box indicates the median, and the plots depict the distribution of data, allowing comparison of the range, interquartile range, and median.

Fig. 4
figure 4

Box plots showing the values of properties influencing landslides

The unit weight shows greater whisker and interquartile ranges for non-occurrence cases than for occurrence cases, regardless of the conditions being dry or saturated (Fig. 4a, b). The ranges of specific gravity are similar for occurrence and non-occurrence cases (Fig. 4c). The interquartile range and median of porosity are higher for occurrence cases than non-occurrence cases; porosity is expected to be proportional to occurrence, as soil can hold much water, increasing its weight and reducing the resistance force (Fig. 4d). The median saturated water content is slightly higher for occurrence cases than for non-occurrence cases (Fig. 4e) and is interpreted similarly to the results of porosity. The interquartile range and median of soil depth are higher for occurrence cases than non-occurrence cases (Fig. 4f). Those of friction angle tend to be lower for occurrence cases than non-occurrence cases, whereas cohesion has the opposite tendency, being positively correlated with occurrence (Fig. 4g, h). In terms of mechanics, cohesion is generally proportional to non-occurrence. However, if the friction angle and cohesion were measured from direct shear tests, they would be inversely proportional to each other according to the Mohr–Coulomb failure criterion (Moon et al. 2020). For this reason, the friction angle is inversely proportional to occurrence, and cohesion is proportional to occurrence. The interquartile range of the slope angle is higher for occurrence than non-occurrence (Fig. 4i). The median profile and plan curvatures are inversely proportional to occurrence, meaning that a number of landslides occurred near valleys with concave topography (Fig. 4j, k). The interquartile range and median of TWI are higher for occurrence than non-occurrence, indicating that soil containing water was prone to landslides (Fig. 4l).

Data preprocessing and screening

Data preprocessing and screening for statistical analysis were performed using min–max normalization and multicollinearity diagnosis. Min–max normalization was performed for the 12 measured independent variables (dry unit weight (kN/m3), saturated unit weight (kN/m3), specific gravity, porosity (%), saturated water content (%), friction angle (°), cohesion (kPa), soil depth (m), slope angle (°), profile curvature, plan curvature, and TWI) (Eq. (11)):


where \({X}_{n}\), \(X\), \({X}_{min}\), and \({X}_{max}\) are the normalized, observed, minimum observed, and maximum observed values, respectively. The normalized results are all between 0 and 1, which facilitates direct comparison of the effects of each dependent variable (despite their initially different distributions and units) on the dependent variable (i.e., landslide occurrence).

Multicollinearity is a phenomenon in which negative effects (such as the overestimation of regression model variables and degraded reliability of regression results) may occur when highly correlated independent variables are used in regression analysis (Ryu 2008). Therefore, collinearity must be assessed before regression analysis. The variation inflation factor (VIF; Eq. (12)) can be used for this. A VIF of ≥ 10 indicates multicollinearity (Kutner et al. 2004).


where \({R}^{2}\) is the coefficient of determination. The left side of Table 1 lists the estimated multicollinearity among the 12 independent variables. Those with VIF values of ≥ 10, and thus high correlations corresponding to multicollinearity, are the dry unit weight (\({\gamma }_{d}\)), saturated unit weight (\({\gamma }_{sat}\)), specific gravity (\({G}_{s}\)), porosity (\(e\)), and saturated water content (\(w\)). These factors can be related using Eqs. (13) to (15):

$${\gamma }_{d}=\frac{{G}_{s}}{1+e}$$
$${\gamma }_{sat}=\left(1+w\right){\gamma }_{d}$$
$${G}_{s}\times w=S\times e$$

where \(S\) is the degree of saturation.

Table 1 Variation inflation factors (VIFs) for properties influencing landslide susceptibility to assess multicollinearity. Although there are 12 factors in the first step, only nine factors are left after multicollinearity check (three factors are eliminated)

As collinearity depends on the number of independent variables, factors with high multicollinearity are eliminated step by step so that the VIF of all independent variables could be < 10. Finally, statistical analysis and machine learning are performed using nine independent variables after removing dry unit weight, specific gravity, and saturated water content (the right side of Table 1).

Results of analyses

Logistic regression (LR)

LR analysis uses the nine independent variables filtered through data screening. The AUC of the LR model of 0.776 indicates high prediction ability. In addition, the regression model is determined to be valid, because the Nagelkerke R-squared value, which describes the significance level and reliability of the regression model, is 0.410, and the Hosmer and Lemeshow test significance probability is 0.317. Table 2 shows the regression coefficient of the above regression model and the influence of each independent variable on landside occurrence. Soil depth has the greatest influence (25.76%), followed by porosity and friction angle.

Table 2 Results of logistic regression analysis

Structural equation model (SEM)

Figure 5 and Table 3 show the results of SEM analysis. The entire model system includes the internal and external models. The internal model, comprising physical properties, mechanical properties, topographic properties, and occurrence, is depicted by arrows pointing at occurrence, as each latent variable affects this outcome. The external model is shown by arrows pointing to the independent variables from each latent variable. The label on each arrow gives its statistical weight, which is a measure of how effectively the latent variable can explain the independent or other latent variables. The weight is the same as the effectiveness in Table 3.

Fig. 5
figure 5

Results of SEM analysis, in which physical properties, mechanical properties, and topographic characteristics affect landslide occurrence. Numbers near each arrow in the external and internal models indicate absolute values of the statistical weight, which quantifies that factor’s effect on occurrence

Table 3 Results of quantified influence of factors on landslides. The total influence is calculated by multiplying the influence in the external model by that in the internal model

According to path model theory, the effectiveness of each factor is the product of the weight in the external model and that in the internal model. For example, the effect of soil depth on occurrence is calculated as 0.945 × 0.579 in the external model and 0.547 in the internal model (Table 3). The most influential factor is soil depth; the next most influential factors in order are porosity, saturated unit weight, and slope angle. The cohesion, profile curvature, and TWI have little effect on occurrence.

Reliability assessment of the entire model is based on the confidence level using p-values and the goodness of fit index (GFI). The statistical criterion evaluating the significance of the results at the 95% confidence level is considered here: p < 0.05 indicates satisfying the 95% confidence level. The GFI is calculated from the average communality and/or the geometric mean of the average determination coefficient. The criteria for high and low confidence in the GFI are the same as those used for R2, as the statistical meaning of GFI is similar to that of the determination coefficient. The criteria are as follows (Zikmund 2000; Moore et al. 2013; Sanchez 2013):

  • Low: R2 < 0.3,

  • Moderate: 0.3 < R2 < 0.6,

  • High: R2 > 0.6.

The resulting p-value and GFI of the entire model are 0.000 and 0.763, respectively, which means that our results can be considered statistically significant with a “high” confidence grade.


Hyperparameter optimization, the most critical analysis process in machine learning, is first performed by grid searching. This involves selecting hyperparameter values that exhibit the highest performance by selecting hyperparameter candidate values at regular intervals. Hyperparameter selection uses verification in five layers (Table 4).

Table 4 Results of hyperparameter optimization using XGBoost

Log-loss is used to evaluate the performance of the training and test models. The learning performance and confusion matrix results are shown in Figs. 6 and 7, respectively, and Table 5 lists the predictive performance results for each model. The learning performance results for the prediction model show similar performance across most of the training models, but a 9:1 ratio of training to test data gives the best performance. For the test models, using an 8:2 ratio gives the best performance. Outstanding predictive performance is obtained, as indicated by the accuracy and AUC ranging from 60 to 90% and the difference in accuracy and AUC between the training and test models being < 20%. For precision and recall, there is a significant trade-off in the prediction model that uses all the test data. The precision and recall of the model using a 9:1 data ratio are both 100%, depending on the label value. Given the excessively small proportion of test data, the probability of predicting an actual “0” value as “0” or an actual “1” value as “1” is considered unreliable. Training–test data ratios of 8:2 and 7:3 minimize the trade-off. As a data ratio of 8:2 leads to a slightly higher performance than 7:3, it is considered optimal for a prediction model using XGBoost.

Fig. 6
figure 6

Results of learning performance for XGBoost prediction models using different ratios of training and test data

Fig. 7
figure 7

Confusion matrix results for XGBoost prediction models employing each data ratio

Table 5 Results of XGBoost prediction for different ratios of training and test data

Table 6 lists the influence of each factor in the XGBoost 8:2 model. Soil depth has the most prominent influence, followed by friction angle, slope angle, plan curvature, and porosity. The saturated unit weight, cohesion, profile curvature, and TWI appear to have no significant influence in this model.

Table 6 Each factor’s influence in the XGBoost 8:2 model


LightGBM applies grid searching for hyperparameter optimization, which uses verification in five layers (Table 7).

Table 7 Results of hyperparameter optimization for LightGBM

For LightGBM, log-loss is used to evaluate the performance of the training and test models. The learning performance and confusion matrix results are shown in Figs. 8 and 9, respectively, and Table 8 lists the predictive performance results for each model. The learning performance results for the prediction model show that the training model with a 9:1 training-to-test-data ratio performs best, similar to the learning performance results for XGBoost, and the test model with an 8:2 ratio performs best. However, the learning performance of the 9:1 test model decreases as learning progresses. The predictive performance results show that the accuracy and AUC decrease as the proportion of the training data decrease in the training and test models. There is a significant trade-off in most of the test models among the recall, precision, and confusion matrix results. Among them, those with 8:2 and 5:5 ratios perform best. Given the high proportion of test data for the 5:5 model, a 8:2 data ratio is considered optimal for the prediction model.

Fig. 8
figure 8

Results of learning performance for LightGBM prediction models using different data ratios

Fig. 9
figure 9

Confusion matrix results for LightGBM prediction models using different data ratios

Table 8 Results of LightGBM prediction for different training and test data ratios

Table 9 lists the influence of each factor on the LightGBM 8:2 model. Soil depth is the most influential, followed by friction angle, plan curvature, cohesion, porosity, slope, profile curvature, TWI, and saturated unit weight. However, the model depends markedly on the first two factors, which represent 82% of the total influence; the other factors have no significant influence (each < 7%).

Table 9 Influence of each factor in the LightGBM 8:2 model

Discussion: results and comparison of methods

Table 10 summarizes the influence of the selected factors for each analysis method. Soil depth is consistently the most influential (> 25%). The rankings of friction angle, slope angle, and porosity differ slightly among the analysis methods. Friction angle shows uniform influence (> 10%) across all the methods. Porosity substantially influences LR and SEM analyses but has minimal effect on machine learning. Among the machine learning methods, only XGBoost is substantially influenced by slope angle. Saturated unit weight, profile curvature, plan curvature, TWI, and cohesion generally have small (< 10%) influences.

Table 10 Summary of influence and rank for analysis methods

Soil depth is the most influential factor because it relates directly to conditions that may cause landslides. The data in section "The theory of structural equation model" clearly show its correlation with landslide occurrence: soil depth is generally ≤ 2 m in areas with no landslide and ≥ 1 m in most areas where landslides have occurred. Friction and slope angles are highly influential, as they directly affect the driving and resistance forces of soil (Mehrotra et al. 1992; Budimir et al. 2015; Ҫellek 2020). Porosity is rarely considered significant when investigating the factors influencing landslides on natural slopes. However, when artificial compaction is performed, as on forest road slopes, porosity is highly influential because it represents the degree of compaction. Unit weight ranks fifth here for artificial slopes because porosity and unit weight are inversely proportional. This study finds topographic factors (profile curvature, plan curvature, and TWI) to be insignificantly influential because landslides around forest roads are more affected by the degree of compaction or resistance force than the concave or convex shape of the slope. Cohesion acts only on resistance force in stability analysis and significantly affects slope activities (Cousins 1978; Ahmadi-Adli et al. 2014; Lin et al. 2016). However, this study attributes insignificant influence to cohesion because the sandy (SP to SW) soil in the study area has low cohesion, and the calculation of cohesion significantly deviated as the Mohr–Coulomb failure criterion was applied within a small range (the median value of cohesion is higher for occurrence sites, whereas IQR and whisker are higher for non-occurrence sites; Fig. 4h).


Data for a mountainous area with forest roads were acquired through geological surveying, sampling, and laboratory testing, and the influence on landslide susceptibility of each measured parameter was analyzed using statistical and machine learning methods. The results are summarized as follows.

  1. (1)

    The target area was Sancheok-myeon, Chungju-si, where rainfall of 259 mm on August 2, 2020 caused several landslides along the forest road in a narrow area of approximately 3 km × 4 km. As the area has the same rainfall and vegetation conditions, the influences of the physico-mechanical characteristics of the soil and topographic characteristics could be analyzed precisely.

  2. (2)

    Geological surveying and sampling were conducted at 40 survey points where landslides occurred and 45 points where they did not. The soil’s physico-mechanical characteristics and topographic factors for each survey point were acquired. Only nine factors were subjected to statistical analysis and machine learning methods.

  3. (3)

    LR and SEM analysis results showed high accuracy, with values of 0.776 and 0.763, respectively. XGBoost and LightGBM exhibited outstanding performance in predicting landslides, with accuracy and AUC of 60%–90%, and differences of < 20% between the training and test data.

  4. (4)

    All analysis methods identified soil depth as having the greatest influence on landslide occurrence. Friction angle, slope angle, and porosity were also selected as influential factors, although they differed slightly in the rankings of the different analysis methods.

As the analysis results of this study are for an area across which rainfall and vegetation conditions are largely consistent, the influences of the soil’s physico-mechanical characteristics and the topography were analyzed more precisely than in studies comparing landslides across multiple regions. The results of this study are expected to be useful in the preparation of landslide vulnerability maps around forest roads.

Availability of data and materials

Data and materials are available upon request.


  • Ahmadi-Adli M, Huvaj N, Toker NK (2014) Effects of the size of particles on rainfall-induced slope instability in granular soils. In: Proceedings of the Geo0Congress 2014, Altanta, GA, USA, 23–26 Feburary 2014

  • An KM (2021) Developing a prediction model for firm innovation and performance using statistical matching and machine learning ensemble techniques. Dongguk University, Doctoral dissertation 289

  • ASTM D2216-10 (2010) Standard test methods for laboratory determination of water (moisture) content of soil and rock by mass. ASTM International, West Conshohocken, PA.

  • ASTM D2487-17 (2017) Practice for classification of soils for engineering purposes (Unified Soil Classification System). ASTM International, West Conshohocken, PA, 2017.

  • ASTM D3080-98 (1998) Standard test method for direct shear test of soils under consolidated drained conditions. ASTM International, West Conshohocken, PA.

  • ASTM D422-63 (2007) Standard test method for particle-size analysis of soils. ASTM International, West Conshohocken, PA, 2007.

  • ASTM D854-10 (2010) Standard test methods for specific gravity of soil solids by water pycnometer. ASTM International, West Conshohocken, PA.

  • Beven KJ, Kirkby MJ (1979) A physically based, variable contributing area model of basin hydrology. Hydrol Sci Bull 24:43–69

    Article  Google Scholar 

  • Broothaerts N, Kissi E, Poesen J, Van Rompaey A, Getahun K, Van Ranst E, Diels J (2012) Spatial patterns, causes and consequences of landslides in the Gilgel gibe catchment, SW Ethiopia. CATENA 97:127–136

    Article  Google Scholar 

  • Budimir MEA, Atkinson PM, Lewis HG (2015) A systematic review of landslide probability mapping using logisitic regression. Landslides 12:419–436

    Article  Google Scholar 

  • Ҫellek S (2020) Effect of the slope angle and its classification on landslide. Nat Hazard 87:23

    Google Scholar 

  • Chae BG, Kim WY, Cho YC, Kim KS, Lee CO, Choi YS (2004) Development of a logistic regression model for probabilistic prediction of debris flow. J Eng Geol 14(2):211–222 ((in Korean with English abstract))

    Google Scholar 

  • Chen F, Yu B, Li B (2018) A practical trial of landslide detection from single-temporal Landsat8 images using contour-based proposals and random forest: a case study of national Nepal. Landslides 15:453–464

    Article  Google Scholar 

  • Chen SC, Chang CC, Chan HC, Huang LM, Lin LL (2013) Modeling typhoon event-induced landslides using GIS-based logistic regression: A case study of Alisan Forestry Railway, Taiwan. Math Problems Eng Article ID 728304

  • Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGK DD International Conference on Knowledge Discovery and Data Mining, Newyork, USALACM, pp 785–794.

  • Choi YH, Lee JW, Kim MJ (2011) A Study on development standard calculation program of forest road drainage facilities. J Kor Soc for Sci 100(1):25–33 ((in korean with English abstract))

    Google Scholar 

  • Cousins BF (1978) Stability charts for simple earth slopes. J Geotech Eng Div 104:267–279

    Article  Google Scholar 

  • CRED (2023) 2022 Disasters in numbers. Brussels: CRED, Retrieved from

  • Dikau R (1989) The application of a digital relief model to landform analysis in geomorphology. In: Three dimensional applications in geographical information systems. CRC Press, pp 51–77

  • DSBA (2020) November 12, 04-7: Ensemble Learning – XGBoost, Youtube, Retrieved from

  • Dutton AL, Loague K, Wemple BC (2005) Simulated effect of a forest road on near-surface hydrologic response and slope stability. Earth Surf Proc Land 30:325–338

    Article  Google Scholar 

  • Eker R, Aydin A (2014) Assessment of forest road conditions in terms of landslide susceptibility: a case study in Yığılca Forest Directorate (Turkey). Turk J Agric for 38(2):281–290

    Article  Google Scholar 

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874

    Article  Google Scholar 

  • Fernandes NF, Guimarães RF, Gomes RAT, Vieira BC, Montgomery DR, Greenberg H (2004) Topographic controls of landslides in Rio de Janeiro: field evidence and modeling. CATENA 55:163–181

    Article  Google Scholar 

  • Gerrard J, Gardner R (2002) Relationships between landsliding and land use in the Likhu Khola drainage basin, Middle Hills, Nepal. Mount Res Dev 22:48–55

    Article  Google Scholar 

  • Godt JW, Baum RL, Savage WZ, Salciarini D, Schulz WH, Harp EL (2008) Transient deterministic shallow landslide modeling: requirements for susceptibility and hazard assessments in a GIS framework. Eng Geol 102(3–4):214–226

    Article  Google Scholar 

  • Hasegawa S, Dahal RK, Yamanaka M, Bhandary NP, Yatabe R, Inagaki H (2009) Causes of large-scale landslides in the Lesser Himalaya of central Nepal. Environ Geol 57:1423–1434

    Article  CAS  Google Scholar 

  • Hox JJ, Bechger TM (1999) An introduction to structural equation modeling. Family Sci Rev 11:354–373

    Google Scholar 

  • John JC, Douglas S (2012) Landslides types, mechanisms and modeling. Cambridge University Press, p 420

  • Jotisankasa A, Vathananukij H (2008) Investigation of soil moisture characteristics of landslide-prone slopes in Thailand. In: International Conference on Management of Landslide Hazard in the Asia-Pacific Region 11th -15th November 2008 Sendai Japan, p 12

  • Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30:1

    Google Scholar 

  • Kimaro DN, Msanya BM, Kilasara M, Mtakwa PW, Poesen J, Deckers JA (2000) Major factors influencing the occurrence of landslides in the northern slopes of the Uluguru Mountains, Tanzania. Workshop Presentation, pp 67–78

  • Kim HS (2022) Analysis on major influential factors and Occurrence probability of landslide in forest road. Phd Thesis, Chungbuk National University, p 114

  • KMA (Korea Meteorological Administration) (2023), Open MET data portal.

  • Korea Forest Service (2021) Comprehensive measures for national landslide prevention, p 1–56

  • Kutner MH, Nachtsheim CJ, Neter J (2004) Applied linear regression models, 4th ed. McGraw-Hill Education, p 701

  • Lin HD, Jiang YS, Wang CC, Chen HY (2016) Assessment of apparent cohesion of unsaturated lateritic soil using an unconfined compression test. In: Proceedings of the 2016 world congress on advances in civil, environmental, and materials research (ACEM16), Jeju, Korea, 28 August–1 September 2016

  • Luce CH, Wemple BC (2000) Special issue: hydrologic and geomorphic effects of forest roads. Earth Surf Proc Land 26:111–232

    Article  Google Scholar 

  • Mehrotra R, Namuduri K, Ranganathan N (1992) Gabor filter-based edge detection. Pattern Recogn 25(12):1479–1494

    Article  Google Scholar 

  • Moon SW, Yun HS, Seo YS (2020) Physical properties and friction characteristics of fault cores in South Korea. Econ Environ Geol 53(1):71–85

    Google Scholar 

  • Moore DS, Notz WI, Flinger MA (2013) The basic practice of statistics, 6th ed. WH Freeman and Company, New York, NY, p 138

  • Nugraha H, Wacano D, Dipayana GA, Cahyadi A, Mutaqinc BW, Larasati A (2015) Geomorphometric characteristics of landslides in the Tinalah watershed, Menoreh Mountains, Yogyakarta, Indonesia. Procedia Environ Sci 28:578–586

    Article  Google Scholar 

  • Pham BT, Pradhan B, Bui DT, Prakash I, Dholakia MB (2016) A comparative study of different machine learning methods for landslide susceptibility assessment: a case study of Uttarakhand area (India). Environ Model Softw 84:240–250

    Article  Google Scholar 

  • Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2017) CatBoost: unbiased boosting with categorical features. In: 32nd Conferences on neural information processing systems, Montreal, Canada, p 31

  • Osman N, Barakbah SS (2006) Parameters to predict slope stability soil water and root profiles. Ecol Eng 28:90–95

    Article  Google Scholar 

  • Owen RC (1981) Soil strength and microclimate in the distribution of shallow landslides. J Hydrol 20:17–26

    Google Scholar 

  • Qu M, Bai Y, Hu Q, He L, Qiu E, Wan X (2021) A comprehensive prediction method for the saturated internal friction angle of sliding zone soils based on landslide engineering requirements. Geotech Eng 25:4144–4158

    Google Scholar 

  • Quan HC, Lee BG, Lee CS, Ko JW (2011) The landslide probability analysis using logistic regression analysis and artificial neural network methods in Jeju. J Kor Soc Geospat Inf Sci 19(3):33–40 ((in Korean with English abstract))

    Google Scholar 

  • Regmi NR, Giardino JR, Vitek JD (2010) Modeling susceptibility to landslides using the weights of evidence approach: Western Colorado, USA. Geomorphology 115:172–187

    Article  Google Scholar 

  • Ryu SG (2008) Effects of Multicollinearity in logit model. J Kor Soc Transp 26(1):113–126 ((in korean with English abstract))

    Google Scholar 

  • Sanchez G (2013) PLS path modeling with R. Trowchez Editions, Berkeley: 222

  • Schmidt K, Roering J, Stock J, Dietrich W, Montgomery D, Schaub T (2001) The variability of root cohesion as an influence on shallow landslide susceptibility in the Oregon Coast Range. Can Geotech J 38:995–1024

    Article  Google Scholar 

  • Šimundić AM (2009) Measures of diagnostic accuracy: basic definitions. J Int Feder Clin Chem Lab Med 19(4):203–211

    Google Scholar 

  • Ullman JB, Bentler PM (2013) Structural equation modeling. In: JA Schinka and WF Velicer (eds), Handbook of Psychology (vol 2), Research Methods in Psychology, Hoboken, NJ, Wiley, pp 661–690

  • Vanacker V, Molina A, Govers G, Poesen J, Dercon G, Deckers S (2005) River channel response to short-term human-induced change in landscape connectivity in Andean ecosystems. Geomorphology 72:340–353

    Article  Google Scholar 

  • Wang LH, Guo M, Sawada K, Lin J, Zhang J (2016) A comparative study of landslide susceptibility maps using logistic regression, frequency ratio, decision tree, weights of evidence and artificial neural network. Geosci J 20(1):117–136

    Article  Google Scholar 

  • Wemple BC, Jones JA, Grant GE (1996) Channel network extension by logging roads in two basins, Western Cascades, Oregon. J Am Water Resour Assoc 32(6):1195–1207

    Article  Google Scholar 

  • Wobus CW, Hodges KV, Whipple KX (2003) Has focused denudation sustained active thrusting at the Himalayan topographic front? Geology 31:861

    Article  Google Scholar 

  • Wright S (1921) Correlation and causation. J Agric Res 20:557–585

    Google Scholar 

  • Xiao T, Segoni S, Chen L, Yin K, Casagli N (2020) A step beyond landslide susceptibility maps: a simple method to investigate and explain the different outcomes obtained by different approaches. Landslides 17:627–640

    Article  Google Scholar 

  • Yoon YG (2020) Feature extraction and analysis of electrocardiogram using LightGBM. Master’s thesis of Korea University, p 28

  • Yung Y (2008) Structural equation modeling and path analysis using PROC TCALIS in SAS 9.2. in Proceedings of the SAS Global Forum 2008 Conference Cary NC SAS Institute Inc. Paper 384

  • Zikmund WG (2000) Business research methods (6th ed). Fort Worth: Harcourt College Publishers: 513

Download references


Authors are grateful to editorial board and anonymous reviewers for the constructive comments that improved the manuscript.


This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(No.2020R1A6A3A03038855).

Author information

Authors and Affiliations



MSW, SYS: Conceptualization, methodology, MSW, KHS: Investigation, data collection, MSW, KHS, NJD, KSS: Data analysis, software, validation, MSW, NJD: Writing-original draft preparation, SYS, KSS: Reviewing and editing of draft, supervision. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Yong-Seok Seo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moon, SW., Noh, J., Kim, HS. et al. Comparison of factors influencing landslide risk near a forest road in Chungju-si, South Korea. Geoenviron Disasters 11, 3 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: