缘起
每拿到新资料时,总用pandas做一些重複性的探勘工作,
今天发现一个好套件-pandas-profiling,
套件作者觉得describe实在是太阳春了,用这个一键帮你完成以下初步的资料分析。
本文
安装(择一)
pip install pandas-profilingconda install pandas-profiling
需求
目前是连网版,需要网路连线下载一些Bootstrap跟JQuery。
準备好资料
from sklearn.datasets import load_bostondata = load_boston()["data"]cols = load_boston()["feature_names"]df = pd.DataFrame(data=data, columns=cols)
丢进去分析
profile = pandas_profiling.ProfileReport(df)profile.to_file(outputfile="output.html") #支援输出html
ProfileReport Attributes
df : DataFrame
Data to be analyzed
bins : int
Number of bins in histogram.
The default is 10.
check_correlation : boolean
Whether or not to check correlation.
It'sTrue
by default.
correlation_threshold: float
Threshold to determine if the variable pair is correlated.
The default is 0.9.
correlation_overrides : list
Variable names not to be rejected because they are correlated.
There is no variable in the list (None
) by default.
check_recoded : boolean
Whether or not to check recoded correlation (memory heavy feature).
Since it's an expensive computation it can be activated for small datasets.
check_correlation
must be true to disable this check.
It'sFalse
by default.
pool_size : int
Number of workers in thread pool
The default is equal to the number of CPU.Methods
get_description
Return the description (a raw statistical summary) of the dataset.
get_rejected_variables
Return the list of rejected variable or an empty list if there is no rejected variables.
to_file
Write the report to a file.
to_html
Return the report as an HTML string.
点进去可以看detail
好东西分享,真是太方便了对吧?感恩作者,讚叹作者!!
Reference:
官网