Pythonでhistgramを描く方法まとめ¶

Pythonの以下のライブラリを使ってhistgramを描く。

pandas
matplotlib
seaborn
plotly, plotly-express
bokeh

以下のような細かい設定・よく使う設定もなるべく一緒にまとめる。

binの設定
ラベル別のhistgram
違う列データもまとめて表示

実行環境¶

mac OS Catalina
pipenv

!python --version

Python 3.7.4

!pip list | grep -E '^(pandas|numpy|scikit-learn|matplotlib|seaborn|plotly|plotly-express|bokeh) '

bokeh              2.0.1       
matplotlib         3.2.1       
numpy              1.18.2      
pandas             1.0.3       
plotly             4.6.0       
plotly-express     0.4.1       
scikit-learn       0.22.2.post1
seaborn            0.10.0

パッケージをimportしておく¶

import os, sys

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import bokeh

%matplotlib inline

データを準備する¶

scikit-learnの load_iris() をDataFrameに格納して使う

詳しいデータの内容は 7.2.2. Iris plants dataset を参照。

データだけのDataFrame df
品種情報 (target_name) も追加したDataFrame、 dft

以上の2つを用意する。

from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()

dft = df.copy()
dft['target'] = iris.target
dft['target_name'] = dft.target.replace({i: name for i, name in enumerate(iris.target_names)})
dft.head()

pandas¶

pandasはmatpllotlibを呼び出す。細かい設定をするときの引数はmatplotlibと同じ。

DataFrameに対するhistgramは、DataFrame.hist() と DataFrame.plot.hist() の2種類あり、出力が微妙に異なる。

df.hist() はカラムごとにグラフを出力する
- x軸・y軸やbinが揃っていないので、揃えたい場合は引数を渡す
  - x軸・y軸を揃えたいときの引数は、それぞれ sharex=True, sharey=True
  - binを揃えたいときはrangeを指定する e.g. range=(0, df.max().max())
df.plot.hist() は全てのカラムを1つのグラフに出力する
- デフォルトでhistgramが重なって胃しまうので、alpha を指定することで透明度を変更
- stacked=True にすることで積み上げ

# v1.0.0から backendを指定できるようになったのでmatplotlibかどうか確認
pd.options.plotting.backend

'matplotlib'

df.hist()

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11f4e5890>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12154d110>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x121584790>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1215bbe10>]],
      dtype=object)

# rangeを統一する
df.hist(sharex=True, sharey=True, range=(0, df.max().max()))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1216a3110>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1217dcd50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1217abd90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12184cf10>]],
      dtype=object)

df.plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x121775590>

df.plot.hist(alpha=0.3)

<matplotlib.axes._subplots.AxesSubplot at 0x121ad1f90>

df.plot.hist(stacked=True)

<matplotlib.axes._subplots.AxesSubplot at 0x121c57b10>

Series, Groupby histgram¶

SeriesやgroupbyしたDataFrameに対しても hist() 関数は実装されている。

df['petal length (cm)'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x121dc7150>

dft[['petal length (cm)', 'target_name']].groupby('target_name').hist()

target_name
setosa        [[AxesSubplot(0.125,0.125;0.775x0.755)]]
versicolor    [[AxesSubplot(0.125,0.125;0.775x0.755)]]
virginica     [[AxesSubplot(0.125,0.125;0.775x0.755)]]
dtype: object

ラベルごとのhistgram¶

ラベル (target_name)ごとのhistgramは groupby() や pivot() を使えばうまく出力できる

dft.groupby('target_name')['petal length (cm)'].hist(range=(0,df.max().max()), alpha=0.3)

target_name
setosa        AxesSubplot(0.125,0.125;0.775x0.755)
versicolor    AxesSubplot(0.125,0.125;0.775x0.755)
virginica     AxesSubplot(0.125,0.125;0.775x0.755)
Name: petal length (cm), dtype: object

dft.pivot(columns='target_name')['petal length (cm)'].hist(range=(0,df.max().max()))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1222d7550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x122335ad0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1223ccb90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x122418610>]],
      dtype=object)

matplotlib¶

pandasが呼び出しているのは plt.hist()

plt.hist()
- 渡すデータ x は (n,) array or sequence of (n,) arrays なので、DataFrameを渡すときは注意

# NG
print(df[['petal length (cm)']].shape)
plt.hist(x=df[['petal length (cm)']])

(150, 1)

(array([[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 1., 0., 0.],
        [0., 0., 0., ..., 1., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array([1.  , 1.59, 2.18, 2.77, 3.36, 3.95, 4.54, 5.13, 5.72, 6.31, 6.9 ]),
 <a list of 150 Lists of Patches objects>)

# OK
print(df.loc[:,'petal length (cm)'].shape)
plt.hist(x=df.loc[:,'petal length (cm)'])

(150,)

(array([37., 13.,  0.,  3.,  8., 26., 29., 18., 11.,  5.]),
 array([1.  , 1.59, 2.18, 2.77, 3.36, 3.95, 4.54, 5.13, 5.72, 6.31, 6.9 ]),
 <a list of 10 Patch objects>)

凡例の表示¶

label にラベル名(カラム名)を渡し、 plt.legend()を実行すると、凡例を出力できる

凡例の位置は調整可能 (参考: https://qiita.com/matsui-k20xx/items/291400ed56a39ed63462)

cols = ['petal length (cm)','petal width (cm)']
print(df.loc[:,cols].T.shape)
plt.hist(x=df.loc[:,cols].T, label=cols)
plt.legend()

(2, 150)

<matplotlib.legend.Legend at 0x12390cdd0>

表示設定¶

histtype を指定すると、細かい設定をせずに出力できる

typeは {'bar', 'barstacked', 'step', 'stepfilled'} の4種類ある。

(参考 https://matplotlib.org/3.1.1/gallery/statistics/histogram_multihist.html)

また、複数のグラフを表示したいときは subplots を使う (参考 matplotlib.pyplot.subplots)

fig, axes = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True, figsize=(12, 8))

for htype,ax in zip(['bar', 'barstacked', 'step', 'stepfilled'], axes.flatten()):
    ax.hist(iris.data, histtype=htype, label=iris.feature_names, bins=10)
    ax.set_title(htype)
    ax.legend(loc='upper right')
    
# fig.show()

save graph¶

savefig() で保存

fig.savefig('matplotlib_sample.png')

ラベルごと・複数データに対するhistgram(matplotlib編)¶

sequence of (n,) arrays であれば、labelにラベル名を渡すことができる。

異なる場所にあるデータを1つのhistgramに集約させたい場合はplt.hist() を重ねればよい。

fig, ax = plt.subplots(1,1)

for n in iris.target_names:
    # DataFrameでは本来バラバラで渡す必要はないけど、具体例のためにtarget_nameのデータを取り出しています
    x= dft.query('target_name=="{}"'.format(n))['petal length (cm)']
    ax.hist(x=x, label=n)
    
ax.legend(loc='upper right')
# fig.show()

<matplotlib.legend.Legend at 0x123e132d0>

fig, ax = plt.subplots(1,1)

x_data = []
for n in iris.target_names:
    x_data.append(dft.query('target_name=="{}"'.format(n))['petal length (cm)'])

# listで一気に渡してもよい
ax.hist(x=x_data, label=iris.target_names, histtype='barstacked')
ax.legend(loc='upper right')
# fig.show()

<matplotlib.legend.Legend at 0x123ec4290>

seaborn¶

sns.hist() はない。histgramに適しているのは sns.distplot

kde (kernel density estimate) がデフォルトで True になっている
- 必要ないときは kde=False にする
渡すことのできるデータ a が　Series, 1d-array, or list なのでSeriesをそのまま渡しても予想通りの挙動をする

だいたいmatplotlibと同じように使えるが、ちょいちょい使い勝手が違うので、sns.set()して、matplotlib使う方が正直楽なところもある。

sns.set()

sns.distplot(df[['petal length (cm)']], bins=10)

<matplotlib.axes._subplots.AxesSubplot at 0x123e5a9d0>

# rangeなど matplotlib側の設定を変えたいときは hist_kws で渡す
sns.distplot(df[['petal length (cm)']], bins=10,kde=False, hist_kws={'range':(0,10)})

<matplotlib.axes._subplots.AxesSubplot at 0x1240ebcd0>

ラベルごと・複数データに対するhistgram(seaborn編)¶

seabornでは1次元配列しか受け取れない。

そのため、複数データを1つのhistgramに集約させたい場合はsns.distplot() を重ねる必要がある。

histtype は設定できないので、stackしたり、ならべたいならラベルごと・複数データに対するhistgram(matplotlib編)) の方法で。

for i in range(4):
    sns.distplot(iris.data[:,i],
                 label=iris.feature_names[i],
                 kde=False, 
                 bins=10,
                 hist_kws={'range': [1,10]}
                )
    
plt.legend()

<matplotlib.legend.Legend at 0x122ca9110>

save¶

matplotlibと同じ

plt.savefig('seaborn_sample.png')

<Figure size 432x288 with 0 Axes>

plotly¶

plotlyはjavascriptベースで、tooltipを自動で設定してくれるのが良い。見た目もデフォルトのままで使える。

kaggleのkernelでもよくみる。

plotlyそのままだと書くコード量が多くなる。なるべくplotly-expressを使いたい。

細かい調整をしたくなると、結局plotly本体の設定を変更することになる。

histgramについてはだいたい https://plotly.com/python/histograms/ のページで解決する。

(参考 https://qiita.com/inoory/items/12028af62018bf367722)

plotlyだけでhistgram¶

plotly-expressを使わず、plotlyだけでhistgramを描くにはgo.Figure()とgo.Histgram()を使う。

go.Figure() に go.Histgram()のリストを渡すことで複数データのhistgramを作成できる。

import plotly
import plotly.graph_objs as go

# offlineで使う
plotly.offline.init_notebook_mode(connected=False)
# 以降、offlineで使う場合は fig.show()をplotly.offline.iplot(fig)に置き換える

fig = go.Figure(data=go.Histogram(x=iris.data[:,0]))
fig.show()

# target_name ごとのhistgramを一緒に表示
data=[
    go.Histogram(x=dft.query('target_name=="{}"'.format(n))['petal length (cm)'], name=n)
      for n
    in iris.target_names
]

fig = go.Figure(data=data)
fig.show()

範囲 (range) 設定は xbins で行う。
また、data はリストでなくても add_trace で逐一追加することも可能

fig = go.Figure()

for i in range(len(iris.feature_names)):
    fig.add_trace(go.Histogram(x=iris.data[:,i], name=iris.feature_names[i],
                 xbins={'start':0, 'end':10, 'size':1}))

fig.show()

save¶

htmlでも出力できる https://plotly.com/python/interactive-html-export/
offline で保存する場合は引数にfilenameを渡す
- 参考 http://python.zombie-hunting-club.com/entry/2017/11/03/223753#3-オフラインで使ってみる

plotly.offline.plot(fig, filename='plotly_sample.html')

'plotly_sample.html'

barmode¶

barmode でレイアウトが変更できる
- モードは "stack" | "group" | "overlay" | "relative"
- go.Layout(barmode='ove') を go.Figure()に渡すか、 fig.updte_layout() で一気に更新する
透明度は opacity
- go.Histgram() で逐一指定するか、 fig.update_traces()で一気に更新する

data=[
    go.Histogram(x=iris.data[:,i], name=iris.feature_names[i],
                 xbins={'start':0, 'end':10, 'size':0.5})
      for i 
    in range(len(iris.feature_names))
]

layout = go.Layout(barmode='overlay')
fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.3)
fig.show()

data=[
    go.Histogram(x=iris.data[:,i], name=iris.feature_names[i],
                 xbins={'start':0, 'end':10, 'size':0.5},
                opacity=0.8)
      for i 
    in range(len(iris.feature_names))
]

layout = go.Layout(barmode='stack')
fig = go.Figure(data=data, layout=layout)
fig.show()

plotly.subplots¶

matplotlibと同じように複数グラフを表示できる (参考 https://plotly.com/python/subplots/)

from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2)


fig.add_trace(
    go.Histogram(x=iris.data[:,0], name=iris.feature_names[0],
                 xbins={'start':0, 'end':10, 'size':0.5}),
    row=1, col=1
)

fig.add_trace(
    go.Histogram(x=iris.data[:,1], name=iris.feature_names[1],
                 xbins={'start':0, 'end':10, 'size':0.5}),
    row=1, col=2
)

fig.update_layout(title_text="plotly.subplot")
fig.show()

plotly-express¶

plotlyを楽に呼び出して使える

DataFrameとカラム名を渡すだけでhistgramが描ける
color にカラム名を渡せば、渡したカラムごとに色分けできる
細かい設定の引数はplotlyと同じ (opacity など)

import plotly.express as px
fig = px.histogram(df, x='petal length (cm)')
fig.show()

fig = px.histogram(dft, x='petal length (cm)', color='target_name')
fig.show()

fig = px.histogram(dft, x='petal length (cm)', color='target_name', opacity=0.3)
fig.update_layout(barmode='overlay')
fig.show()

plotly-express and subplots¶

color など細かい設定には対応していない

(参考 https://github.com/plotly/plotly_express/issues/83)

fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    px.histogram(dft, x='petal length (cm)')['data'][0],
    row=1, col=1
)

fig.add_trace(
    px.histogram(dft, x='petal width (cm)')['data'][0],
    row=1, col=2
)

fig.update_layout(title_text="length and width")
fig.show()

bokeh¶

bokeh自体にhistgram機能はないので、棒グラフ(vbar) にnumpyで算出したhistgramの値を渡す必要がある

参考

import bokeh

from bokeh.io import output_notebook, show
from bokeh.plotting import figure, output_file, show
output_notebook()

# データの作成
hist, edges = np.histogram(iris.data[:,0], range=(0,10), bins=10)
edges, hist

(array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]),
 array([ 0,  0,  0,  0, 22, 61, 54, 13,  0,  0]))

len(hist), len(edges)

(10, 11)

from bokeh.plotting import ColumnDataSource

source = ColumnDataSource(data=dict(
    x=edges[:-1],
    y=hist,
    name=edges[:-1],
))

p = figure(plot_height=350,
           title=iris.feature_names[0],
           toolbar_location=None,
           tools="hover",
           tooltips="x: @x y: @y"
          )

p.vbar(top='y',x='x', width=0.9, source=source)

p.y_range.start = 0
show(p)

save¶

https://docs.bokeh.org/en/latest/docs/user_guide/export.html
詳しくは割愛

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target_name
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2