Comparison of factors influencing landslide risk near a forest road in Chungju-si, South Korea

Background The study aimed to identify the influential factors required to prepare landslide vulnerability maps and establish disaster prevention measures for mountainous areas with forest roads. The target area is Sancheok‑ myeon, Chungju‑si, where several landslides have occurred in a narrow area of approximately 3 km × 4 km. As the area has the same rainfall and vegetation conditions, the influences of the physico‑mechanical characteristics of the soil in accordance with compaction and topographic characteristics could be analyzed precisely. Methods Geological surveying, sampling, and laboratory testing assessed landslide risk in the study area, and data including unit weight, specific gravity, porosity, water content, soil depth, friction angle, cohesion, slope angle, profile/ plan curvature, TWI were obtained. Preprocessing and screening such as min‑max normalization and multicollinearity were conducted for the data in order to eliminate overestimation of each factor’s effectiveness. The influence of each factor was analyzed using logistic regression (LR), structural equation modeling (SEM), extreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM). Results All methods showed that soil depth has the greatest impact on landslide occurrence. Friction angle, slope angle, and porosity were also selected as influential factors, although each method ranked them slightly differ‑ ently. Topographic factors, such as plan curvature, profile curvature, and the topographic wetness index, had mini‑ mal influence. This appears to be because landslides near forest roads are more affected by how well compaction was performed during banking than by the concave or convex shape of the slope. This study presents analysis results for an area with the same rainfall and vegetation conditions; therefore, the analysis of the influence of the physico‑ mechanical characteristics of the soil and topography was more precise than when comparing landslides occurring in different regions. Our results may be helpful in preparing landslide vulnerability maps.


Introduction
Landslides constitute one of the most dangerous types of natural disaster.They are known to have caused 838 annual deaths globally between 2002 and 2021 (CRED 2023).The precise mapping of landslide susceptibility and methods to assess landslide risk to decrease their potential damage have received substantial research attention.However, predicting landslide occurrence remains difficult despite sustained research efforts, because it is affected by complex interactions among many factors, including geological conditions, geomorphology, climate, earthquakes, and vegetation (Gerrard and Gardner 2002;Wobus et al. 2003;Hasegawa et al. 2009).The main factors influencing landslide occurrence and the relationships among them remain unclear without rainfall factor, thus hindering precise landslide prediction (John and Douglas 2012).
With reference to analysis methods related to landslide, statistical methods including conditional probability, weight of evidence, frequency ratio (FR), and logistic regression (LR) were typically used in the 1990s and 2000s to analyze the influences of factors causing landslides and to predict landslides.Machine learning methods, such as artificial neural networks and deep learning, have been used since the 2010s.For example, EKER and Aydin (2014) prepared a landslide vulnerability map in an analysis of landslide vulnerability for different road types (e.g., forest roads and expressways) by conducting geographic-information-system-based LR analysis of land use, petrology, elevation, slope, side, distance to rivers, distance to roads, and plan curvature.Pham et al. (2016) assessed landslide vulnerability in 930 landslide areas by analyzing Google images using support vector machine, LR, Fisher's linear discriminant analysis, Bayesian network, and naïve Bayes techniques.Wang et al. (2016) proposed a landslide prediction model using LR, FR, decision tree, weight of evidence, and artificial neural networks.Chen et al. (2018) proposed a landslide vulnerability model using a random forest (RF) algorithm based on a digital elevation model and Landsat-8 data.Xiao et al. (2020) proposed a landslide vulnerability model using hybrid models combining RF, FR, CF (certainty factor), and the index of entropy (IOE), namely RF-FR, RF-CF, RF-IOE, IOE-CF, and CF-FR.Further methods-such as big data, machine learning, and deep learning methods, which may overcome existing mathematical and engineering limitations-have been actively used in recent years.Representative prediction models include boosting-based models, such as extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and category boost (CatBoost) (Chen and Guestrin 2016;Ke et al. 2017;Prokhorenkova et al. 2017).
In relation to influential factors of landslide, precipitation or rainfall intensity was pointed out as the most influential factors on landslide in research cases using data-driven analysis because rainfall conditions are different with each other due to greater distance between data collection points as numerous landslide cases need to be analyzed (Chae et al. 2004;Quan et al. 2011;Chen et al. 2013).Also, in physically based analysis, the friction angle and cohesion included in the slope stability analysis equation were evaluated as the most dominant factors on landslide (Regmi et al. 2010;Qu et al. 2021).Besides, it was reported that the stability of slope with complete vegetation cover is higher than that of slope with meagre vegetation (Schmidt et al. 2001;Osman and Barakbah 2006), and there were substantial interests in the effects of forest roads hydrologically and geomorphically on earth surface and landforms (Luce and Wemple 2000;Dutton et al. 2005).Vanacker et al. (2005) reported that changes in the forest landscape or large-scale logging, which change the soil infiltration and ground evapotranspiration rates, thus indirectly affect the water contents in soil and reduce slope stability.It was reported that the soils from landslide prone areas were mainly silty soils with low plasticity (Jotisankasa and Vathananukij 2008).Nugraha et al. (2015) argued that land surface (geomorphometric) characteristics have a significant relationship with the landslide distribution, and even others have emphasized the role of investigating topographic influence (Fernandes et al. 2004;Broothaerts et al. 2012).In addition, Owen (1981) said that the sunny aspects were much more susceptible to landslide than the shady aspects.When the soils are saturated, the liquid limit water content of the sunny aspect subsoil is exceeded, while that of the shady aspect subsoil is not.Meanwhile, Kimaro et al. (2000) suggested that the most important soil characteristics is presence of saprolite or boundary with hard bed rock.As mentioned above, the most influential factors are differently evaluated depending on researcher's perspectives because various factors including vegetation, climate, geology, topography and so on, affect landslides.Throughout Korea's many mountainous areas, several forest roads have been constructed for forest management.Construction of these roads involves the formation of cutting and banking slopes, which affect slope stability by changing the ground, topography, and water flow (Wempleet al. 1996;Choi et al. 2011).In addition, a recent increase in guerrilla rainstorms caused by climate change has increased landslide risk.According to the Korea Forest Service, the frequency of landslides is increasing annually, and the size of a landslide depends on the region and season, with typhoons and heavy rainfall being concentrated in summer (Korea Forest Service 2021).Therefore, previous regional landslide analyses have focused primarily on rainfall, which is an external factor; therefore, detailed risk plan considering internal factors has not been facilitated when activities such as the selection of areas for reinforcement and the establishment of disaster prevention measures are conducted.The present study analyzes the effects of soil, topographic factors, and rainfall on landslides using statistical and machine learning methods to identify major influencing factors.The results may aid in the preparation of landslide vulnerability maps and establish disaster prevention measures (e.g., prioritizing areas for reinforcement) within budgetary constraints.

Methodology
The theory of logistic regression analysis Logistic regression (LR) analysis determines correlations between a dependent variable and multiple independent variables influencing it.The probability of an event can be calculated and expressed as a value between 0 and 1.Values can be binarized by those ≥ 0.5 being assigned as 1, and those < 0.5 being assigned as 0. The probability of an event through LR analysis ( P Z ) is given by Eqs. ( 1) and ( 2): where Z is the LR, α is a constant, and β n is the regression coefficient of the independent variable ( X n ).
The Nagelkerke R-squared, Hosmer and Lemeshow, and confusion matrix verification methods were used to analyze the reliability of the results of LR analysis.Nagelkerke R-squared indicates the degree to which independent variables can explain the dependent variable.A value of ≥ 20% indicates that the independent variables have explanatory power.The Hosmer and Lemeshow test determines the overall goodness-of-fit of a regression model.A significance level of > 0.05 indicates that the model has explanatory power.The confusion matrix estimates prediction accuracy as the area under the curve (AUC) calculated for an ROC curve with an X-axis of "1-specificity" and a Y-axis of "sensitivity".Values of AUC are distributed between 0 and 1, with a value close to 1 indicating an accurate model (Fawcett 2006;Godt et al. 2008;Šimundić 2009).Accuracy, specificity, and sensitivity can be calculated using Eqs.(3), (4), and (5), respectively: (1) where a true positive (TP) is the correct prediction of a positive value, a true negative (TN) is a correct prediction of a negative value, a false positive (FP) is a negative value incorrectly predicted as positive, and a false negative (FN) is a positive value incorrectly predicted as negative.

The theory of structural equation model
The SEM used here was first suggested by Wright (1921).It is a path-analysis-based statistical method used to identify causal relationships among multiple variables with complex interrelationships and in cases with many independent and dependent variables.Although it seems similar to multiple regression analysis, it is a more detailed model because it can consider the mutual influences of all variables and can easily identify interrelationships among variables using graphical representation (Hox and Bechger 1999;Yung 2008;Ullman and Bentler 2013).The SEM can be subdivided into a measurement model for confirmatory factor analysis and a theoretical model for multiple regression and path analyses.The former is applied when each variable can explain latent variables perfectly, and the latter is used to group variables into representative latent variables and to find the most descriptive model.When studying landslides, the latter model is more suitable than the former, as many factors related to slope failure or landslide occurrence are difficult to explain fully, and there are limitations related to ground heterogeneity.
Theoretical modeling in an SEM is calculated using partial least squares (PLS), which are subdivided into partial least squares regression (PLS-R) and partial least squares path models (PLS-PMs).PLS-R is used when there are more variables than the number of data, and PLS-PMs are applied to analyze interrelationships.This study applies a PLS-PM, which can deal with a large amount of data and analyze interrelationships and causal analysis among influential factors related to landslide occurrence.PLS-PM analysis is a multivariate analysis technique that analyzes the systems of relationships among many blocks containing variables.This approach follows the component-based estimation procedure, and it is defined by the following two basic concepts: each block of variables acts as a latent variable, and it is assumed that there is a system of linear relationships between blocks.The analysis (4) Specificity = TN (TN + FP) (5) Sensiticity = TP (TP + FN ) considers multiple relationships between variable blocks, and each variable block is assumed to be represented by a latent variable or theoretical concept.Here, each latent variable is a hypothetical variable created to generate an SEM.The latent variables are grouped with variables that have similar characteristics.The determination and interrelationships of the latent variables are set by the researcher's subjective judgment, and continuous modification and supplementation are required until an optimal model is developed.
Figure 1 outlines the PLS-PM, showing the constituent manifest (dependent and independent variables, X ij ) and latent variables.The first step in PLS-PM analysis is to set the latent variables-which comprise manifest variables with similar characteristics (Fig. 1a)-and then the internal model is determined based on the latent variables of the external model (Fig. 1b).The setup of the external and internal models is then modified and updated until statistically significant results are obtained for the arrangement of variables, causal relationships, and error terms.The last step assesses the confidence of the entire model using the external and internal models estimated from the above two steps (Fig. 1c).Here, weight and loading (α and β, respectively; Fig. 1) are essentially correlation coefficients.

The theory of XGBoost
XGBoost uses the classification and regression tree (CART) model for the existing gradient boosting algorithm and enables parallel processing, thereby enabling the resolution of various problems using data mining (Chen and Guestrin 2016;DSBA 2020;Yoon 2020;An 2021).Unlike other tree-based learning methods, the learning of XGBoost uses Eq. ( 6) based on the CART model.When the data comprise input variable x and output variable y, y is the predicted value of data x, K is the number of CARTs, and f is the CART model (Chen and Guestrin 2016).Equation ( 7) gives the objective function for training the CART model.Here, l y i , y i is the differ- ence between the actual and predicted values, and is the regularization of the model to prevent overfitting.The objective function equation at step t can be expressed using XGBoost's additive method and Taylor expansion, as shown in Eq. ( 8).According to the definition of the Taylor expansion, g i is the first-order derivative of y i and can be defined as g i = δ y (t−1) l y i , y (t−1) ; h i is the sec- ond-order partial derivative of y i (t−1) and can be defined 1) .The greedy and approximate algorithms are used to optimize the prediction model and identify the optimal split point using above equations (Chen and Guestrin 2016).

The theory of LightGBM
LightGBM is a GBM-and tree-based algorithm that performs learning using residuals.However, unlike the symmetric division method of conventional trees, its tree structure is asymmetrical due to its use of a leafwise methodology.LightGBM uses a feature histogram that divides continuous variables into discrete sections (bins) during learning.This method learns a function from slope space g to input space X S using a decision tree.In the presence of the training set of n independent and identically distributed entities {x 1 , x 2 , • • • x n } , X S is a vector with a dimension of s.For GBM, the loss function for the model output value generated at each iteration is defined as a negative slope g 1 , g 2 , • • • g n .This model uses Eq. ( 9) to divide each node through a variable with the largest information acquisition (Ke et al. 2017) where O is the training data in the tree node, d is the node, and j is the variable performing division at point d.However, this method is inefficient because it searches all divided sections.To prevent this, gradient-based oneside sampling (a method of reducing the number of data) and exclusive feature bundling (a method of reducing the number of variables) are used. (6)

Status of landslide occurrence and sampling locations
To analyze the factors influencing landslide occurrence, site investigation and sampling were conducted in Sancheok- According to precipitation record of the KMA from 2000 to 2023 (KMA 2023), the average annual precipitation of study area is 1,196 mm, and it is similar to that of South Korea (about 1200 mm).So, this region is an area where disaster caused by rainfall are rare.However, annual average precipitation of 1500 mm, a daily precipitation of 316 mm, and a maximum hourly precipitation of 76.5 mm were all record breaking in 2020 when a lot of landslides occurred.
Figure 2 shows sampling locations and photographs of the landslides that occurred in the study area.Sampling locations are colored either red or yellow: the 40 red points are locations at which landslides occurred, and the 45 yellow points indicate sampling locations where no landslides occurred.Locations with or without landslides were sampled in the one point (in case of occurrence, sampling was conducted in the head part) to provide the statistical and machine learning analyses with the necessary data for both landslide and non-landslide locations.Most landslides in the study area occurred near the forest road (Fig. 2) because the slope angle steepens at the cut slope, and the soil thickness increases where the construction of the road involved slope filling.The upper slopes of forest road comprised weathered soil of biotite granite, and the lower slopes of forest road were made up of embanked soil which was excavated when constructing forest road.Therefore, soil type of landslides was all same with weathered soil of granite, and that it was actually composed mostly of sand with SW-SP (well grade sand-poor grade sand).This led to the mineralogical compositions of soil particles being consistent across the sampling locations, as the roads had been constructed at the same time.The study area allows specific analysis of the influence of topography and physico-mechanical characteristics associated with soil compaction on landslide occurrence owing to vegetation and rainfall conditions being consistent throughout the area.
The dataset comprises the following information for each site: presence or absence of landslide occurrence (hereafter abbreviated as occurrence or non-occurrence); thickness of the soil layer (hereafter abbreviated as soil depth); slope angle; plan curvature and profile curvature; TWI; dry and saturated unit weights of the soil; and porosity, specific gravity, saturated water content, friction angle, and cohesion of the soil.Sampling was conducted in the head parts of areas where landslides occurred.Elevation was not considered, as the sampling points were at similar altitudes.Data for landslide occurrence and soil depth were obtained from site investigations and dynamic cone penetration testing, and physico-mechanical properties (unit weight, specific gravity, porosity, friction angle, and cohesion) were measured according to the test criteria of the American Society for Testing Materials (ASTM D2216-10; ASTM D2487-17; ASTM D3080-98; ASTM D422-63; ASTM D854-10).Topographic characteristics (slope angle, profile, and plan curvatures) were gained from 1:50,000 digital topographic maps of the National Geographic Information Institute and SAGA GIS software (IBM).The profile and plan curvatures describe whether the slope is concave (negative value) or convex (positive value) longitudinally and in crosssection, respectively (Fig. 3).The TWI is an indicator of the wet content of the soil and is calculated using Eq. ( 10) (Beven and Kirkby 1979): where SCA denotes the local upslope area draining through a certain point per unit contour length, and θ is the local slope in radians.The SCA is calculated using multiple flow directions, as the flow may vary according to the slope's direction and gradient.Most of the factors are continuous data; the only categorical factor is occurrence (1 for occurrence and 0 for non-occurrence).
(10) TWI = ln SCA tanθ The unit weight shows greater whisker and interquartile ranges for non-occurrence cases than for occurrence cases, regardless of the conditions being dry or saturated (Fig. 4a, b).The ranges of specific gravity are similar for occurrence and non-occurrence cases (Fig. 4c).The interquartile range and median of porosity are higher for occurrence cases than non-occurrence cases; porosity is expected to be proportional to occurrence, as soil can hold much water, increasing its weight and reducing the resistance force (Fig. 4d).The median saturated water content is slightly higher for occurrence cases than for non-occurrence cases (Fig. 4e) and is interpreted similarly to the results of porosity.The interquartile range and median of soil depth are higher for occurrence cases than non-occurrence cases (Fig. 4f ).Those of friction angle tend to be lower for occurrence cases than non-occurrence cases, whereas cohesion has the opposite tendency, being positively correlated with occurrence (Fig. 4g, h).In terms of mechanics, cohesion is generally proportional to non-occurrence.However, if the friction angle and cohesion were measured from direct shear tests, they would be inversely proportional to each other according to the Mohr-Coulomb failure criterion (Moon et al. 2020).For this reason, the friction angle is inversely proportional to occurrence, and cohesion is proportional to occurrence.The interquartile range of the slope angle is higher for occurrence than non-occurrence (Fig. 4i).The median profile and plan curvatures are inversely proportional to occurrence, meaning that a number of landslides occurred near valleys with concave topography (Fig. 4j,  k).The interquartile range and median of TWI are higher for occurrence than non-occurrence, indicating that soil containing water was prone to landslides (Fig. 4l).

Data preprocessing and screening
Data preprocessing and screening for statistical analysis were performed using min-max normalization and multicollinearity diagnosis.Min-max normalization was performed for the 12 measured independent variables (dry unit weight (kN/m 3 ), saturated unit weight (kN/m 3 ), specific gravity, porosity (%), saturated water content (%), friction angle (°), cohesion (kPa), soil depth (m), slope angle (°), profile curvature, plan curvature, and TWI) (Eq.( 11)): where X n , X , X min , and X max are the normalized, observed, minimum observed, and maximum observed values, respectively.The normalized results are all between 0 and 1, which facilitates direct comparison of the effects of each dependent variable (despite their initially different distributions and units) on the dependent variable (i.e., landslide occurrence).
Multicollinearity is a phenomenon in which negative effects (such as the overestimation of regression model variables and degraded reliability of regression results) may occur when highly correlated independent variables are used in regression analysis (Ryu 2008).Therefore, collinearity must be assessed before regression analysis.The variation inflation factor (VIF; Eq. ( 12)) can be used for this.A VIF of ≥ 10 indicates multicollinearity (Kutner et al. 2004).
where R 2 is the coefficient of determination.The left side of Table 1 lists the estimated multicollinearity among the 12 independent variables.Those with VIF values of ≥ 10, and thus high correlations corresponding to multicollinearity, are the dry unit weight ( γ d ), saturated unit weight ( γ sat ), specific gravity ( G s ), porosity ( e ), and saturated water content ( w ).These factors can be related using Eqs.( 13) to ( 15 where S is the degree of saturation. As collinearity depends on the number of independent variables, factors with high multicollinearity are eliminated step by step so that the VIF of all independent variables could be < 10.Finally, statistical analysis and machine learning are performed using nine independent variables after removing dry unit weight, specific gravity, and saturated water content (the right side of Table 1).

Logistic regression (LR)
LR analysis uses the nine independent variables filtered through data screening.The AUC of the LR model of 0.776 indicates high prediction ability.In addition, the regression model is determined to be valid, because the Nagelkerke R-squared value, which describes the nificance level and reliability of the regression model, is 0.410, and the Hosmer and Lemeshow test significance probability is 0.317.Table 2 shows the regression coefficient of the above regression model and the influence of each independent variable on landside occurrence.Soil depth has the greatest influence (25.76%), followed by porosity and friction angle.( 13)

Structural equation model (SEM)
Figure 5 and Table 3 show the results of SEM analysis.The entire model system includes the internal and external models.The internal model, comprising physical properties, mechanical properties, topographic properties, and occurrence, is depicted by arrows pointing at occurrence, as each latent variable affects this outcome.The external model is shown by arrows pointing to the independent variables from each latent variable.The label on each arrow gives its statistical weight, which is a measure of how effectively the latent variable can explain the independent or other latent variables.The weight is the same as the effectiveness in Table 3.
According to path model theory, the effectiveness of each factor is the product of the weight in the external model and that in the internal model.For example, the effect of soil depth on occurrence is calculated as 0.945 × 0.579 in the external model and 0.547 in the internal model (Table 3).The most influential factor is soil depth; the next most influential factors in order are porosity, saturated unit weight, and slope angle.The   cohesion, profile curvature, and TWI have little effect on occurrence.
Reliability assessment of the entire model is based on the confidence level using p-values and the goodness of fit index (GFI).The statistical criterion evaluating the significance of the results at the 95% confidence level is considered here: p < 0.05 indicates satisfying the 95% confidence level.The GFI is calculated from the average communality and/or the geometric mean of the average determination coefficient.The criteria for high and low confidence in the GFI are the same as those used for R 2 , as the statistical meaning of GFI is similar to that of the determination coefficient.The criteria are as follows (Zikmund 2000;Moore et al. 2013;Sanchez 2013): • Low: R 2 < 0.3, • Moderate: 0.3 < R 2 < 0.6, • High: R 2 > 0.6.
The resulting p-value and GFI of the entire model are 0.000 and 0.763, respectively, which means that our results can be considered statistically significant with a "high" confidence grade.

XGBoost
Hyperparameter optimization, the most critical analysis process in machine learning, is first performed by grid searching.This involves selecting hyperparameter values that exhibit the highest performance by selecting hyperparameter candidate values at regular intervals.Hyperparameter selection uses verification in five layers (Table 4).
Log-loss is used to evaluate the performance of the training and test models.The learning performance and confusion matrix results are shown in Figs. 6 and  7, respectively, and Table 5 lists the predictive performance results for each model.The learning performance results for the prediction model show similar performance across most of the training models, but a 9:1 ratio of training to test data gives the best performance.For the test models, using an 8:2 ratio gives the best performance.Outstanding predictive performance is obtained, as indicated by the accuracy and AUC ranging from 60 to 90% and the difference in accuracy and AUC between the training and test models being < 20%.For precision and recall, there is a significant trade-off in the prediction model that uses all the test data.The precision and recall of the model using a 9:1 data ratio are both 100%, depending on the label value.Given the excessively small proportion of test data, the probability of predicting an actual "0" value as "0" or an actual "1" value as "1" is considered unreliable.Training-test data ratios of 8:2 and 7:3 minimize the trade-off.As a data ratio of 8:2 leads to a slightly higher performance than 7:3, it is considered optimal for a prediction model using XGBoost.
Table 6 lists the influence of each factor in the XGBoost 8:2 model.Soil depth has the most prominent influence, followed by friction angle, slope angle, plan curvature, and porosity.The saturated unit weight, cohesion, profile curvature, and TWI appear to have no significant influence in this model.

LightGBM
LightGBM applies grid searching for hyperparameter optimization, which uses verification in five layers (Table 7).
For LightGBM, log-loss is used to evaluate the performance of the training and test models.The learning performance and confusion matrix results are shown in Figs. 8 and 9, respectively, and Table 8 lists the predictive performance results for each model.The learning performance results for the prediction model show that the training model with a 9:1 training-to-test-data ratio performs best, similar to the learning performance results for XGBoost, and the test model with an 8:2 ratio performs best.However, the learning performance of    There is a significant trade-off in most of the test models among the recall, precision, and confusion matrix results.Among them, those with 8:2 and 5:5 ratios perform best.Given the high proportion of test data for the 5:5 model, a 8:2 data ratio is considered optimal for the prediction model.Table 9 lists the influence of each factor on the Light-GBM 8:2 model.Soil depth is the most influential, followed by friction angle, plan curvature, cohesion, porosity, slope, profile curvature, TWI, and saturated unit weight.However, the model depends markedly on the first two factors, which represent 82% of the total influence; the other factors have no significant influence (each < 7%).

Discussion: results and comparison of methods
Table 10 summarizes the influence of the selected factors for each analysis method.Soil depth is consistently the most influential (> 25%).The rankings of friction angle, slope angle, and porosity differ slightly among the analysis methods.Friction angle shows uniform influence (> 10%) across all the methods.Porosity substantially influences LR and SEM analyses but has minimal effect on machine learning.Among the machine learning methods, only XGBoost is substantially influenced by slope angle.Saturated unit weight, profile curvature, plan curvature, TWI, and cohesion generally have small (< 10%) influences.
Soil depth is the most influential factor because it relates directly to conditions that may cause landslides.The data in section "The theory of structural equation model" clearly show its correlation with landslide occurrence: soil depth is generally ≤ 2 m in areas with no landslide and ≥ 1 m in most areas where landslides have occurred.Friction and slope angles are highly influential, as they directly affect the driving and resistance forces of soil (Mehrotra et al. 1992;Budimir et al. 2015;Ҫellek 2020).Porosity is rarely considered significant when investigating the factors influencing landslides on natural slopes.However, when artificial compaction is performed, as on forest road slopes, porosity is highly influential because it represents the degree of compaction.Unit weight ranks fifth here for artificial slopes because porosity and unit weight are inversely proportional.This study finds topographic factors (profile curvature, plan curvature, and TWI) to be insignificantly influential because landslides around forest roads are more affected by the degree of compaction or resistance force than the concave or convex shape of the slope.Cohesion acts only on resistance force in stability analysis and significantly affects slope activities (Cousins 1978;Ahmadi-Adli et al. 2014;Lin et al. 2016).However, this study attributes insignificant influence to cohesion because the sandy (SP to SW) soil in the study area has low cohesion, and the calculation of cohesion significantly deviated as the Mohr-Coulomb failure criterion was applied within a small range (the median value of cohesion is higher for occurrence sites, whereas IQR and whisker are higher for non-occurrence sites; Fig. 4h).

Conclusions
Data for a mountainous area with forest roads were acquired through geological surveying, sampling, and laboratory testing, and the influence on landslide susceptibility of each measured parameter was analyzed using statistical and machine learning methods.The results are summarized as follows.
(1) The target area was Sancheok-myeon, Chungju-si, where rainfall of 259 mm on August 2, 2020 caused several landslides along the forest road in a narrow area of approximately 3 km × 4 km.As the area has the same rainfall and vegetation conditions, the influences of the physico-mechanical characteristics of the soil and topographic characteristics could be analyzed precisely.
(2) Geological surveying and sampling were conducted at 40 survey points where landslides occurred and 45 points where they did not.The soil's physicomechanical characteristics and topographic factors for each survey point were acquired.Only nine factors were subjected to statistical analysis and machine learning methods.(3) LR and SEM analysis results showed high accuracy, with values of 0.776 and 0.763, respectively.XGBoost and LightGBM exhibited outstanding performance in predicting landslides, with accuracy and AUC of 60%-90%, and differences of < 20% between the training and test data.(4) All analysis methods identified soil depth as having the greatest influence on landslide occurrence.Friction angle, slope angle, and porosity were also selected as influential factors, although they differed slightly in the rankings of the different analysis methods.
As the analysis results of this study are for an area across which rainfall and vegetation conditions are largely consistent, the influences of the soil's physicomechanical characteristics and the topography were analyzed more precisely than in studies comparing landslides across multiple regions.The results of this study are expected to be useful in the preparation of landslide vulnerability maps around forest roads.

Fig. 1
Fig. 1 Schematic diagram of the PLS-PM.a The external model comprises manifest variables (dependent and independent variables, X ij ) and latent variables.b The internal model consists of latent variables (LV i ).c The complete PLS-PM includes both the internal and external models

Fig. 2
Fig. 2 Locations of sampling points and photographs of landslides along the forest road.Landslides have occurred at 40 of the 85 sampling locations

Fig. 3 Fig. 4
Fig. 3 Schematic diagrams of a profile and b plan curvature (modified after Dikau 1989)

Fig. 5 Fig. 6 Fig. 7
Fig. 5 Results of SEM analysis, in which physical properties, mechanical properties, and topographic characteristics affect landslide occurrence.Numbers near each arrow in the external and internal models indicate absolute values of the statistical weight, which quantifies that factor's effect on occurrence

Fig. 8 Fig. 9
Fig. 8 Results of learning performance for LightGBM prediction models using different data ratios

Table 1
Variation inflation factors (VIFs) for properties influencing landslide susceptibility to assess multicollinearity.Although there are 12 factors in the first step, only nine factors are left after multicollinearity check (three factors are eliminated)

Table 2
Results of logistic regression analysis

Table 3
Results of quantified influence of factors on landslides.The total influence is calculated by multiplying the influence in the external model by that in the internal model

Table 5
Results of XGBoost prediction for different ratios of training and test data

Table 6
Each factor's influence in the XGBoost 8:2 model

Table 7
Results of hyperparameter optimization for LightGBM

Table 8
Results of LightGBM prediction for different training and test data ratios

Table 9
Influence of each factor in the LightGBM 8:2 model

Table 10
Summary of influence and rank for analysis methods