Wednesday, February 19, 2014

Multivariate data displays are not always necessary


Over the past two years I have published a number of posts in which I have used a data-display network as a multivariate data summary, comparable to an ordination (eg. PCA) or a cluster analysis (eg. UPGMA). This is a form of exploratory data analysis.

Here, I wish to point out that a multivariate data summary is not always necessary, even when the data are multivariate in form.

As an example, I will use the official census data on retail book sales in the USA. The monthly data are provided by the United States Census Bureau for the years 1992-2013 at:
 http://www.census.gov/retail/mrts/www/data/excel/mrtssales92-present.xls.
The data include census code 4512, which covers "Book Stores, General", "Specialty Book Stores" and "College Book Stores". The data notes say: "Estimates are shown in millions of dollars, and are based on data from the Monthly Retail Trade Survey, Annual Retail Trade Survey, and administrative records." I downloaded the data on 17 February 2014.

These data are multivariate. For example, if each year is taken as a sample object, then there are data for 12 variables for each sample (one for each month). Any multivariate data analysis can therefore be applied to this dataset.

In the usual manner, I have used the manhattan distance and a neighbor-net network. Years that are closely connected in the network are similar to each other based on the 12 monthly sales figures, and those that are further apart are progressively more different from each other.


However, all that the data show is a gradient clockwise from the top. That is, sales rose from 1992, reached a peak in 2007, and then declined again. That is, the data form a simple time series, and all that is actually needed is to plot them that way.

So, this same pattern could be displayed more simply by graphing the yearly averages, as shown in the next graph. A network is complete over-kill in this case. I presume that the recent decrease in retail book sales has something to do with the rise of e-book sales.


Finally, we could also plot the monthly sales, while we are at it. The peaks in late summer and at Christmas as very distinct. Presumably people are buying books to read in summer, and to give away at Christmas.


Finally, note that not all time series can be plotted in a simple manner. If the time patterns are complex, then a multivariate analysis, such as a network, will probably be of some use as a data display.

No comments:

Post a Comment