Hierarchical clustering of mixed variable panel data based on new distance


Akay Ö., Yüksel G.

COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2019 (SCI-Expanded) identifier identifier

Özet

One of the important aspects of panel data is the poolability of different units in the data set. However, considering that regression parameters (coefficients) are homogeneous across different units, it is normal for the units to be pooled. Clustering may occur in the panel data to solve the problem. In this study, we suggest a new distance for a mixed variable panel data set containing invariant time binary variable, without performing variable conversion to avoid information loss. In this approach, the mixed variable panel data set is divided into pure categorical data and pure numerical data sets. Then, distance measures are calculated using simple matching for the categorical data set and using distance that normalizes all variables in the numerical data set. After distance measures of each data set are combined, the new distance measure is integrated into the agglomerative hierarchical clustering algorithms. The experimental analysis was exemplified by the real data groups using STATA and R software package. The performance of proposed distance is compared with the Gower and K-prototype distances by cluster validation methods. The experimental results demonstrated that the new approach we suggest here provides better clustering results than the Gower and K-prototype approaches.