How To Create A Histogram - PowerPoint PPT Presentation

About This Presentation
Title:

How To Create A Histogram

Description:

When you want to provide information about the distribution of data - more specifically, the density of the underlying distribution of numerical data - a histogram may be the way to go. – PowerPoint PPT presentation

Number of Views:170

less

Transcript and Presenter's Notes

Title: How To Create A Histogram


1
HOW TO CREATE A HISTOGRAM
2
The adage "a picture is worth a thousand words"
is certainly true when it comes to visually
representing data. Means and medians don't
provide as much information as a graph, chart or
histogram, and can even be misleading. When you
want to provide information about the
distribution of data more specifically, the
density of the underlying distribution of
numerical data a histogram may be the way to
go. What are histograms? Histograms are
graphical displays of data using bars of
different heights, called "bins," whose widths
represent ranges of data points. The bars show
the underlying frequency distribution of the
data. If histograms better communicate
information than averages or medians, then why
aren't they used more often? One reason is
simply that creating them takes a little more
time than calculating averages or medians,
especially if you take the time to do it
correctly. Histograms can be very useful at
showing the distribution of data, but if the
number of bins (or bars) is incorrect, the data
can easily be misrepresented, and patterns won't
emerge. Did you know there are methods to
calculate the number of bins and the bin width
based on the data set itself? If these methods
aren't used, the histogram will tell a very
different story. To illustrate the process of
creating a histogram, I pulled Speed Index scores
from HTTP Archive for the top 100 web sites for
April 15, 2016. After filtering out results with
no value, and three highest that had a value of
61,000 (which appears to be timeouts), I have a
data set of 94 data points ranging from a Speed
Index value of 764 to 49,600. It takes almost no
time in a spreadsheet to calculate the average of
5,065 and the median of 2,976, but to summarize
such a diverse set of data this way doesn't tell
the whole story. I want to know if there are
patterns and if the data has a normal
distribution.
3
Creating Histograms with Excel If you are like me
and don't have a fancy statistics and charting
program, you would probably turn to either
Excel, Google Sheets or some other spreadsheet
program to create charts. When using Excel, you
first need to install the Analysis ToolPak
add-on. This tool pak provides the ability to
create histograms. Creating a histogram this
way is a very manual process and no guidance is
provided in terms of proper bin sizes you are
left to guess what the best bin size is. I
decided to randomly create bins with a fixed
width of 2500. This breaks the data into 13
distinct bins, which seems like a good number.
The resulting histogram has a long tail and is
relatively skewed. With this data set the width
of the bins need to be smaller to see more
relevant patterns. Some of you may be looking at
this chart and say, "that's not a histogram,
that's a bar chart. You are correct. There are
numerous differences between histograms and bar
charts, while both have columns using them
interchangeably is incorrect. A histogram
represents binned quantitative data and should
have no spaces or gaps unless a given
distribution has a value of 0. Aside from being
a manual process, a bigger problem with using
Excel to create histograms is you can't actually
create a histogram. You can only create bar
charts. For this reason I suggest using a tool
other than Excel to create histograms. I will
switch over to Google Sheets  for the remainder
of this post. I suspect that people making
histograms with Excel use the method of randomly
choosing bin sizes. The problem is, with a large
data set and range of results, this histogram
doesn't tell us much when almost half of the
results fall into the first bin. To create an
accurate histogram, there are two different
formulas I turn to, depending on the data.
4
Square Root Method An easy way to determine the
number of bins needed is to calculate the square
root of the number of data points. For this data
set there are 94 data points. The square root of
94 is 9.69 which I round up to 10. Next step is
to determine the width of the bins. The Speed
Index scores range from 764 to 49,600. The
easiest way with this data set is to divide the
range of values by the number of bins, giving us
a bin width of 626. (49,600-764)/104883.6 I
like to have histogram bins in nice round
numbers, so I round up to 5000. The distribution
now looks like this If you are using Google
Sheets, you don't actually have to go through any
of these calculations as this is the method used
to create their default histogram chart. This
histogram is actually worse than our Excel bar
chart. If you recall, I mentioned that the bins
needed to be smaller using this method the bins
are larger. For this dataset, this formula
doesn't work, so let's move on to another method.
5
Freedman-Diaconis rule The Freedman-Diaconis rule
is another method to create histograms. This
method calculates the width of the bin and
doesn't attempt to limit the number of bins. This
works well on datasets where there is a large
range between the minimum and maximum values, but
where data is skewed towards one end. It is a
little more complicated to calculate, but much
more practical and I think it is worth the
time. Bin Width 2 (IQR(x))/n? This may look
scary, especially if you're not a math person,
but it is not so bad and relatively easy to do in
a spreadsheet. First let's translate the formula
into English. IQR stands for interquartile
range, also known as the "middle fifty" of a data
set. Quartiles are the three points that divide a
data set into four equal groups, each group
comprising a quarter of the data. Basically, it's
the 25, 50 and 75 marks, where the 25 mark is
the first quartile, the 50 mark is the median,
and the 75 is the third quartile. This indicates
where the bulk of the values are. IQR(x)indicates
we need to calculate the interquartile range for
the values in the data set. n? or n? stands for
the cube root, where n is the total number of
data points. A cube root is a number that, when
multiplied by itself three times, produces the
number. For example, 3 x 3 x 3 27, so the cube
root of 27 is 3. Combining these terms, to
calculate the width of the bins, divide the
inter-quartile range of the data by the cube root
of the number of data points and then multiple by
two.
6
First, to calculate the IQR, the 1st and 3rd
quartiles have to be calculated. This is very
easy to do in Excel or Google Sheets, as both
provide a QUARTILE formula QUARTILE(A2A95,1)
where cell A2 is the location of the
first number in your data set, cell A95 is the
data cell containing the last number in the data
set, and 1 indicates the quartile. To calculate
the third quartile, simply replace 1 with 3. Once
these numbers are obtained, subtract Q1 from
Q3. For my dataset quartile 1 is 1604 and
quartile 3 is 5346.5. To calculate
IQR 5346.5-16043742.5 The formula now reads 2
(3742.5/n?). The next item to calculate is n?.
I again turn to my spreadsheet. N stands for the
number of data points in the data set in this
case it is 94. I enter the following
formula 94(?)
7
  • resulting in a rounded value of 4.5. Inserting
    this into the formula
  • 2(3742.5/4.5)
  • Following the order of operations PEMDAS, we
    first solve the problem within the parentheses
  • 3742.5/4.5831.66
  • Updating the equation
  • 2831.661663.33
  • I round down to 1500.This gives us 33 bins with a
    width of 1500 and the following histogram

8
  • Now we are able to see some patterns, although
    this chart has a number of outliers,
  • and some bins have 0 results. At this point you
    might consider trimming some of the
  • outliers, by focusing on only sites with scores
    in the 97th percentile.
  • Use the following formula to determine the 97th
    percentile
  • PERCENTILE(A2A95,0.97)
  • where A2 is the cell of the first data point, A95
    is the cell for the last data point,
  • and 0.97 is the percentile. For our data set the
    97th percentile is 24,559. This reduces
  • our data set to 91, which is still a reasonably
    sized data set. Using the
  • Freedman-Diaconis rule our new equation is
  • 2(3460/4.5)1537.8
  • In this case our bin size when rounded is the
    same, the difference is the maximum
  • bin will be 25,500 and not 49,500, reducing the
    length of the tail.
  • The resulting histogram looks like this

Taking a few minutes to create a histogram when
dealing with data that has a long tail can
provide more insights than presenting an average
or median. Histograms and other diagram types
help people understand data quickly, enabling
your audience to "see" what you are talking about
well beyond the insight a single number, like an
average, can provide.
9
Instart Logic is the worlds first endpoint-aware
application delivery solution that makes websites
and applications fast, secure, and easy to
operate.
Interested in learning more? Preview our image
optimization capabilities in the Playground
Contact Sales
Write a Comment
User Comments (0)
About PowerShow.com