Head/tail Breaks

Head/tail breaks is a clustering algorithm scheme for data with a heavy-tailed distribution such as power laws and lognormal distributions. The heavy-tailed distribution can be simply referred to the scaling pattern of far more small things than large ones, or alternatively numerous smallest, a very few largest, and some in between the smallest and largest. The classification is done through dividing things into large (or called the head) and small (or called the tail) things around the arithmetic mean or average, and then recursively going on for the division process for the large things or the head until the notion of far more small things than large ones is no longer valid, or with more or less similar things left only.[1] Head/tail breaks is not just for classification, but also for visualization of big data by keeping the head, since the head is self-similar to the whole. Head/tail breaks can be applied not only to vector data such as points, lines and polygons, but also to raster data like digital elevation model (DEM).

1024 cities that follow exactly Zipf's law, which implies that the first largest city is size 1, the second largest city is size 1/2, the third largest city is size 1/3, ... and the smallest city is size 1/1024. The left pattern is produced by head/tail breaks, while the right one by natural breaks, also known as Jenks natural breaks optimization.

Motivation

The head/tail breaks is mainly motivated by inability of conventional classification methods such as equal intervals, quantiles, geometric progressions, standard deviation, and natural breaks - commonly known as Jenks natural breaks optimization or k-means clustering for revealing the underlying scaling pattern of far more small things than large ones. Note that the notion of far more small things than large one is not only referred to geometric property, but also to topological and semantic properties. In this connection, the notion should be interpreted as far more unpopular (or less-connected) things than popular (or well-connected) ones, or far more meaningless things than meaningful ones. Head/tail breaks uses the mean or average as a method to separate between smaller and larger things. Head/tail breaks do not characterize with average values, like with the previously mentioned classical methods (k-means clustering, natural breaks), but uses them only to differentiate. This allows scale-free things, like fractals, to be analysed regardless of lacking meaningful averages. This opens new avenues of analyzing data besides the widely known methods which are characterizing data through their mean values.

Method

Given some variable X that demonstrates a heavy-tailed distribution, there are far more small x than large ones. Take the average of all xi, and obtain the first mean m1. Then calculate the second mean for those xi greater than m1, and obtain m2. In the same recursive way, we can get m3 depending on whether the ending condition of no longer far more small x than large ones is met. For simplicity, we assume there are three means, m1, m2, and m3. This classification leads to four classes: [minimum, m1], (m1, m2], (m2, m3], (m3, maximum]. In general, it can be represented as a recursive function as follows:

    Recursive function Head/tail Breaks:
    Rank the input data values from the biggest to the smallest;
    Compute the mean value of the data
    Break the data (around the mean) into the head and the tail;  
    // the head for data values greater the mean
    // the tail for data values less the mean
    while (length(head)/length(data) <=40%):
        Head/tail Breaks(head);
    End Function

The resulting number of classes is referred to as ht-index, an alternative index to fractal dimension for characterizing complexity of fractals or geographic features: the higher the ht-index, the more complex the fractals.[2]

Threshold or its sensitivity

The criterion to stop the iterative classification process using the head/tail breaks method is that the remaining data (i.e., the head part) are not heavy-tailed, or simply, the head part is no longer a minority (i.e., the proportion of the head part is no longer less than a threshold such as 40%). This threshold is suggested to be 40% by Jiang et al. (2013),[3] just as the codes above (i.e., (length/head)/length(data) ≤ 40%). This process is called head/tail breaks 1.0. But sometimes a larger threshold, for example 50% or more, can be used, as Jiang and Yin (2014)[2] noted in another article: "this condition can be relaxed for many geographic features, such as 50 percent or even more". However, all heads' percentage on average must be smaller than 40% (or 41, 42%), indicating far more small things than large ones. Many real-world data cannot be fit into a perfect long tailed distribution, therefore its threshold can be relaxed structurally. In head/tail breaks 2.0 the threshold only applies to the overall heads' percentage.[4] This means that the percentages of all heads related to the tails should be around 40% on average. Individual classes can have any percentage spit around the average, as long as this averages out as a whole. For example, if there is data distributed in such a way that it has a clearly defined head and tail during the first and second iteration (length(head)/(length(data)<20%) but a much less well defined long tailed distribution for the third iteration (60% in the head), head/tail breaks 2.0 allows the iteration to continue into the fourth iteration which can be distributed 30% head - 70% tail again and so on. As long as the overall threshold is not surpassed the head/tail breaks classification holds.

Rank-size plot and RA index

A good tool to display the scaling pattern, or the heavy-tailed distribution, is the rank-size plot, which is a scatter plot to display a set of values according to their ranks. With this tool, a new index [5] termed as the ratio of areas (RA) in a rank-size plot was defined to characterize the scaling pattern. The RA index has been successfully used in the estimation of traffic conditions. However, the RA index can only be used as a complementary method to the ht-index, because it is ineffective to capture the scaling structure of geographic features.

Other Indices based on the head/tail breaks

In addition to the ht-index, the following indices are also derived with the head/tail breaks.

  • CRG-index. It is developed as a more sensitive ht-index to capture the slight changes of geographic features.[6] In contrast to the ht-index, which is an integer, CRG-index is a real number.
  • Unified metrics. Two unified metrics (UM1 and UM2) were proposed in an AAAG paper [7] for characterizing the fractal nature of geographic features. One can be used to answer the question of “I know there are far more small things than large ones, but how small (or large) are these small (or large) things?”, whereas the other one to answer “I know there are far more small things than large ones, but how many more?”
  • Fht-index: It is a fractional ht-index, which is able to capture fractional hierarchy.[8] The fht-index might be of help for creating an intermediate scale between two consecutive map scales, leading to so called continuous map scales.

Applications

Instead of more or less similar things, there are far more small things than large ones surrounding us. Given the ubiquity of the scaling pattern, head/tail breaks is found to be of use to statistical mapping, map generalization, cognitive mapping and even perception of beauty .[3][9][10] It helps visualize big data, since big data are likely to show the scaling property of far more small things than large ones. The visualization strategy is to recursively drop out the tail parts until the head parts are clear or visible enough.[11] In addition, it helps delineate cities or natural cities to be more precise from various geographic information such as street networks, social media geolocation data, and nighttime images.

Characterizing the imbalance

As the head/tail breaks method can be used iteratively to obtain head parts of a data set, this method actually captures the underlying hierarchy of the data set. For example, if we divide the array (19, 8, 7, 6, 2, 1, 1, 1, 0) with the head/tail breaks method, we can get two head parts, i.e., the first head part (19, 8, 7, 6) and the second head part (19). These two head parts as well as the original array form a three-level hierarchy:

the 1st level (19),

the 2nd level (19, 8, 7, 6), and

the 3rd level (19, 8, 7, 6, 2, 1, 1, 1, 0).

The number of levels of the above-mentioned hierarchy is actually a characterization of the imbalance of the example array, and this number of levels has been termed as the ht-index.[2] With the ht-index, we are able to compare degrees of imbalance of two data sets. For example, the ht-index of the example array (19, 8, 7, 6, 2, 1, 1, 1, 0) is 3, and the ht-index of another array (19, 8, 8, 8, 8, 8, 8, 8, 8) is 2. Therefore, the degree of imbalance of the former array is higher than that of the latter array.

The left panel pattern contains 50,000 natural cities, which can be put into 7 hierarchical levels. It looks like a hair ball. Instead of showing all the 7 hierarchical levels, we show 4 top levels, by dropping out 3 low levels. Now with the right panel, the scaling pattern of far more small cities than large ones emerges. It is important to note that the right pattern (or the remaining part after dropping out the tails) is self-similar to the whole (or the left pattern). Thus the right pattern reflects the underlying structure of the left one, and enables us to see the whole.
The scaling pattern of US terrain surface is distorted by the natural breaks, but revealed by the head/tail breaks.

Delineating natural cities

The term ‘natural cities’ refers to the human settlements or human activities in general on Earth's surface that are naturally or objectively defined and delineated from massive geographic information based on head/tail division rule, a non-recursive form of head/tail breaks.[12][13] Such geographic information could be from various sources, such as massive street junctions [13] and street ends, a massive number of street blocks, nighttime imagery and social media users’ locations etc. Distinctive from conventional cities, the adjective ‘natural’ could be explained not only by the sources of natural cities, but also by the approach to derive them. Natural cities are derived from a meaningful cutoff averaged from a massive number of units extracted from geographic information.[11] Those units vary according to different kinds of geographic information, for example the units could be area units for the street blocks and pixel values for the nighttime images. A natural cities model has been created using ArcGIS model builder,[14] it follows the same process of deriving natural cities from location-based social media,[12] namely, building up huge triangular irregular network (TIN) based on the point features (street nodes in this case) and regarding the triangles which are smaller than a mean value as the natural cities. These natural cities can also be created from other open access information like OpenStreetMap and further be used as an alternative delineation of administrative boundaries. Scaling law can also at the same time correctly be identified and the administrative borders can be created to respect this by the delineation of the natural cities.[15][16]

Color rendering DEM

Current color renderings for DEM or density map are essentially based on conventional classifications such as natural breaks or equal intervals, so they disproportionately exaggerate high elevations or high densities. As a matter of fact, there are not so many high elevations or high-density locations.[17] It was found that coloring based head/tail breaks is more favorable than those by other classifications [18]

Further applications

Other applications of Head/tail breaks:

  • Serving as a method for efficiently estimating the absolute bolzman entropy of numerical raster data[19]
  • Quantifying the multiscale representation of a polyline[20]
  • Increasing computational efficiency in flow data analysis by emphasising the head part of the flow dataset[21]
  • Temporal analysis of urban expansion related to the thermal environment[22]
  • Image analysis where anisotropy is measured in point patterns extracted with a digital pulse transform with the use of head/tail breaks[23]
  • Visualizing and analyzing spatial patterns in bilateral trade[24]
  • To identify urban function graphs,[25] note that this paper applies head/tail breaks on a gaussian kernel density estimation which reduces the accuracy of the head/tail breaks method. Essentially a natural cities approach is taken but the initial data chosen to perform head/tail breaks on has been reduced beforehand. For a better representation of urban function graphs head/tail breaks may be applied as the first step in delineating these areas.
  • Analyzing structures or hotspots naturally occurring within data to highlight areas of interest (Based on natural cities).
    • (Over)Tourism analysis based on short term rentals (like AirBnB) by creating hotspots out of the distribution of rented out apartments.[26]
  • Head/tail breaks can serve as a main indicator that phenomena are distributed long tailed and that paretian thinking should favor gaussian thinking in geographic spaces. For example within biodiversity and pedodiversity studies where there seem to be fractal relationships such as taxa-area relationships[27]. Complementary to this the polygons of soil and vegetation maps also show scaling within their structures. This can be identified and highlighted by using head/tail breaks[28].

Software implementations

The following implementations are available under Free/Open Source Software licenses.

  • HT calculator: a winform application for obtaining related metrics of head/tail breaks applying on a single data array.
  • HT in JavaScript: a JavaScript implementation for applying head/tail breaks on a single data array.
  • HT Mapping tool: a function in the free plug-in Axwoman 6.3 to ArcMap 10.2 that conducts geo-data symbolization automatically based on the head/tail breaks classification.
  • HT in Python: Python and JavaScript code for the head/tail breaks algorithm. It works great for choropleth map coloring.
  • pysal.esda.mapclassify: Python classification schemes for choropleth mapping, including head/tail breaks map classification.
  • smoomapy 0.1.9: Brings smoothed maps through python.
  • Ht-index calculator: A PostgreSQL function for calculating ht-index (also see [29]).
  • RA calculator: Software for calculating the ratio of areas (RA) in a rank-size plot (also see [5]).
gollark: LyricLy is a total nonagon!
gollark: Is that evil laughter?
gollark: What are you *saying*?
gollark: I totally can.
gollark: I am running a simulation of the universe on osmarks.tk server #2 to predict what you'll say.

References

  1. Jiang, Bin (2013). "Head/tail breaks: A new classification scheme for data with a heavy-tailed distribution", The Professional Geographer, 65 (3), 482 – 494.
  2. Jiang, Bin and Yin Junjun (2014). "Ht-index for quantifying the fractal or scaling structure of geographic features", Annals of the Association of American Geographers, 104(3), 530–541.
  3. Jiang, Bin, Liu, Xintao and Jia, Tao (2013). "Scaling of geographic space as a universal rule for map generalization", Annals of the Association of American Geographers, 103(4), 844 – 855.
  4. Jiang, B. (2019). A recursive definition of goodness of space for bridging the concepts of space and place for sustainability. Sustainability, 11(15), 4091. https://arxiv.org/ftp/arxiv/papers/1909/1909.01073.pdf
  5. Gao, Peichao; Liu, Zhao; Tian, Kun; Liu, Gang (2016-03-10). "Characterizing Traffic Conditions from the Perspective of Spatial-Temporal Heterogeneity". ISPRS International Journal of Geo-Information. 5 (3): 34. Bibcode:2016IJGI....5...34G. doi:10.3390/ijgi5030034.
  6. Gao, Peichao; Liu, Zhao; Xie, Meihui; Tian, Kun; Liu, Gang (2016-10-01). "CRG Index: A More Sensitive Ht-Index for Enabling Dynamic Views of Geographic Features". The Professional Geographer. 68 (4): 533–545. doi:10.1080/00330124.2015.1099448. ISSN 0033-0124.
  7. Gao, Peichao; Liu, Zhao; Liu, Gang; Zhao, Hongrui; Xie, Xiaoxiao (2017-06-02). "Unified Metrics for Characterizing the Fractal Nature of Geographic Features". Annals of the American Association of Geographers. 0 (6): 1315–1331. doi:10.1080/24694452.2017.1310022. ISSN 2469-4452.
  8. Jiang, Bin; Ma, Ding (2017). "How complex is a fractal? Head/tail breaks and fractional hierarchy". Journal of Geovisualization and Spatial Analysis. 2: xx–xx. Preprint. arXiv:1703.00814. doi:10.1007/s41651-017-0009-z.
  9. Jiang, Bin (2013b). "The image of the city out of the underlying scaling of city artifacts or locations", Annals of the Association of American Geographers, 103(6), 1552-1566.
  10. Jiang, Bin and Sui, Daniel (2014). "A new kind of beauty out of the underlying scaling of geographic space", The Professional Geographer, 66(4), 676–686
  11. Jiang, Bin (2015). "Head/tail breaks for visualization of city structure and dynamics", Cities, 43, 69 - 77.
  12. Jiang, Bin and Miao, Yufan (2015). "The evolution of natural cities from the perspective of location-based social media", The Professional Geographer, 67(2), 295 - 306.
  13. Long, Ying (2016). "Redefining Chinese city system with emerging new data", Applied Geography, 75, 36 - 48.
  14. Ren, Zheng (2016). "Natural cities model in ArcGIS", http://www.arcgis.com/home/item.html?id=47b1d6fdd1984a6fae916af389cdc57d.
  15. Alvioli, Massimiliano (2020). "Comparative study of delineation of urban areas using imperviousness products and open data". Geomorphometry 2020. 1: 1–4. doi:10.30437/GEOMORPHOMETRY2020_1.
  16. Alvioli, Massimiliano (2020-12-01). "Administrative boundaries and urban areas in Italy: A perspective from scaling laws". Landscape and Urban Planning. 204: 103906. doi:10.1016/j.landurbplan.2020.103906. ISSN 0169-2046.
  17. Jiang, Bin (2015). "Geospatial analysis requires a different way of thinking: The problem of spatial heterogeneity", GeoJournal, 80(1), 1-13.
  18. Wu, Jou-Hsuan (2015). "Examining the new kind of beauty using the human being as a measuring instrument", http://www.diva-portal.org/smash/get/diva2:805296/FULLTEXT01.pdf.
  19. Zhang, Hong; Wu, Zhiwei (February 2020). "A Head/Tail Breaks-Based Method for Efficiently Estimating the Absolute Boltzmann Entropy of Numerical Raster Data". ISPRS International Journal of Geo-Information. 9 (2): 103. Bibcode:2020IJGI....9..103Z. doi:10.3390/ijgi9020103.
  20. Liu, Pengcheng; Xiao, Tianyuan; Xiao, Jia; Ai, Tinghua (2020-04-22). "A multi-scale representation model of polyline based on head/tail breaks". International Journal of Geographical Information Science. 0: 1–21. doi:10.1080/13658816.2020.1753203. ISSN 1365-8816.
  21. Tao, Ran; Gong, Zhaoya; Ma, Qiwei; Thill, Jean-Claude (May 2020). "Boosting Computational Effectiveness in Big Spatial Flow Data Analysis with Intelligent Data Reduction". ISPRS International Journal of Geo-Information. 9 (5): 299. doi:10.3390/ijgi9050299.
  22. Yang, Zhiwei; Chen, Yingbiao; Wu, Zhifeng; Qian, Qinglan; Zheng, Zihao; Huang, Qingyao (2019-09-01). "Spatial heterogeneity of the thermal environment based on the urban expansion of natural cities using open data in Guangzhou, China". Ecological Indicators. 104: 524–534. doi:10.1016/j.ecolind.2019.05.032. ISSN 1470-160X.
  23. Fabris-Rotelli, I.; Stein, A. (2020-05-26). "Use of fractals to measure anisotropy in point patterns extracted with the DPT of an image". Spatial Statistics: 100452. doi:10.1016/j.spasta.2020.100452. ISSN 2211-6753.
  24. Ye, Sijing; Song, Changqing; Cheng, Changxiu; Shen, Shi; Gao, Peichao; Zhang, Ting; Chen, Xiaoqiang; Wang, Yuanhui; Wan, Changjun (June 2020). "Digital Trade Feature Map: A New Method for Visualization and Analysis of Spatial Patterns in Bilateral Trade". ISPRS International Journal of Geo-Information. 9 (6): 363. doi:10.3390/ijgi9060363.
  25. Chen, Yimin; Chen, Xinyue; Liu, Zihui; Li, Xia (2020-02-01). "Understanding the spatial organization of urban functions based on co-location patterns mining: A comparative analysis for 25 Chinese cities". Cities. 97: 102563. doi:10.1016/j.cities.2019.102563. ISSN 0264-2751.
  26. Celata, Filippo; Romano, Antonello (2020-07-07). "Overtourism and online short-term rental platforms in Italian cities". Journal of Sustainable Tourism. 0 (0): 1–20. doi:10.1080/09669582.2020.1788568. ISSN 0966-9582.
  27. Ibáñez, J. J.; Ramírez‐Rosario, B.; Fernández‐Pozo, L. F.; Brevik, E. C. "Exploring Scaling Law of Geographical Space: Gaussian versus Paretian thinking". European Journal of Soil Science. n/a (n/a). doi:10.1111/ejss.13031. ISSN 1365-2389.
  28. Ibáñez, J. J.; Ramírez‐Rosario, B.; Fernández‐Pozo, L. F.; Brevik, E. C. "Land System Diversity, Scaling Laws, and Polygons Map Analysis". European Journal of Soil Science. n/a (n/a). doi:10.1111/ejss.13035. ISSN 1365-2389.
  29. Tian, Kun; Peichao Gao (2015). "A PostgreSQL function for calculating the ht-index (PDF Download Available)". ResearchGate. doi:10.13140/rg.2.1.3041.0324. Retrieved 2017-08-08.

Further reading

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.