【r与数据库】r +数据库=非常完美-

【R 与数据库】R 数据库非常完美前言经常用 R 处理数据的分析师都会对 dplyr 包情有独钟，它强大的数据整理功能让原始数据从杂乱无章到有序清晰，便于后期进一步的深入分析，特别是配合上数据库的使用，更是让分析师如虎添翼，轻松搞定 Excel 难以驾驭的数据容量，下面我们通过一个实用案例来具体看看如何将 R 和数据库完美融合在一起。在以后的博客中我们还会陆续讲解 dplyr包的各种功能和用 SQL 语言访问数据库的方法。dplyr 包可以配合一系列数据库使用，如： sqlite, mysql and postgresql。这里我们着重探讨 sqlite。数据的介绍首先我们来熟悉一下即将用到的数据，在美国，药品的检疫是个严谨的过程，当患者在服用药物后有任何不适反应，都可以将情况反映给相关部门（FDA），而这些收集来的数据也对大众公开，可以下载和分析。在这篇博客里我们会用到有关患者的人口统计信息和针对某种症状患者使用了特定药物，因为中美药物间的差别，我们暂时没有加入所用药品的信息，如果读者感兴趣，可以自行调整分析的范围，这里作者用较少数据力求让读者快速理解如何用 R 来读取网络数据，将其存入数据库，并融合数据集，然后做深入分析。系统准备library(dplyr)library(ggplot2)library(data.table)library(magrittr)下载数据首先我们建立循环语句来下载 2015 上半年的季度性数据(如果空间允许，还可以建立双循环下载多于一年的数据)year=2015for (m in(1:2) )url1paste0(http:/www.nber.org/fda/faers/,year,/demo,year,q,m,.csv.zip)download.file(url1,dest=data.zip) # Demographyunzip (data.zip)url2paste0(http:/www.nber.org/fda/faers/,year,/indi,year,q,m,.csv.zip)download.file(url2,dest=data.zip) # Indicationunzip (data.zip)解析下载数据，构建人口统计信息和反应症状数据集 filenameslist.files(pattern=demo.*.csv,full.names=TRUE)demography = rbindlist(lapply(filenames,fread,select=c(primaryid,caseid,age,event_dt,sex,wt,occr_country)str(demography)# Classes data.table anddata.frame:606551 obs. of7 variables:#$ primaryid: int35032933 36655882 3867118338775713 38783443 40954634 41149942 4135256641943882 42207644 .#$ caseid: int35032933665588 3867118 3877571 3878344 4095463 41149944135256 4194388 4220764 .#$ event_dt: int20000118 NA 20021015 NA NA 20040204 200011 20030320040321 200404 .#$ age: num39 35 54NA 66 .#$ sex: chrF F F M .#$ wt: num83 NA 70 NA NA NA NA NA 60.8NA .#$ occr_country: chrUS DE US GB .#-attr(*, .internal.selfref)=filenameslist.files(pattern=indi.*.csv, full.names=TRUE)indication =rbindlist(lapply(filenames, fread,select=c(primaryid,indi_drug_seq,indi_pt)str(indication)# Classes data.table and data.frame:1409632 obs.of3 variables:#$ primaryid: int3503293335032933 35032933 35032933 35032933 3503293335032933 36655882 36655882 36655882 .#$ indi_drug_seq: int1 2 3 4 5 6 7 1 2 3 .#$ indi_pt: chrMultiple sclerosis Multiple sclerosisDepression Hypercholesterolaemia .#- attr(*,.internal.selfref)=创建数据库这里我们没有给出路径，数据库于是会被建在之前已设好的工作文件夹中 my.dbsrc_sqlite(adverse.events, create = T) # create =T createsa new database 上载数据集到建好的数据库中copy_to(my.db,demography,temporary = FALSE) #uploading demography data# Source: sqlite 3.8.6adverse.events# From: demography 606,551 x 7# #primaryidcaseid event_dtagesexwtoccr_country#(int)(int)(int)(dbl) (chr)(dbl)(chr)# 135032933 3503293 2000011839.000F83.0US# 2366558823665588NA 35.000FNADE# 338671183 3867118 20021015 54.000F70.0US# 438775713 3877571NANAMNAGB# 5387834433878344NA 66.000MNAIT# 640954634 4095463 20040204 65.476FNAJP# 741149942 411499420001117.000FNA# 8413525664135256200303 46.000FNAUS# 941943882 4194388 20040321 75.000F60.8# 1042207644 422076420040418.000FNAUS# . . . . . . . .copy_to(my.db,indication,temporary = FALSE) #uploading indication data# Source: sqlite 3.8.6adverse.events# From: indication 1,409,632 x 3# #primaryid indi_drug_seqindi_pt#(int)(int)(chr)# 1350329331Multiple sclerosis# 2350329332Multiple sclerosis# 3350329333Depression# 4350329334Hypercholesterolaemia# 5350329335Benign neoplasm of thyroid gland# 6350329336Depression# 7350329337Depression# 8366558821Schizoaffective disorder# 9366558822Schizoaffective disorder# 10366558823Schizoaffectivedisorder# . . . . 建立与已有数据库的链接并检索所存数据表my.dbsrc_sqlite(adverse.events, create = F) # create =F to connect to an existing databasesrc_tbls(my.db)# 1demographyindicationsqlite_stat1访问数据库dplyr 包的命令可以借助 SQL 语言来对数据库中的数据进行整理，首先我们用 tbl 来从数据库中导入数据 demography =tbl(my.db,demography)head(demography)#primaryidcaseid event_dtage sex wt occr_country# 135032933 3503293 20000118 39.000F 83US# 236655882 3665588NA 35.000F NADE# 338671183 3867118 20021015 54.000F 70US# 438775713 3877571NANAM NAGB# 538783443 3878344NA 66.000M NAIT# 640954634 4095463 20040204 65.476F NAJPindication = tbl(my.db,indication)head(indication)#primaryid indi_drug_seqindi_pt# 1350329331Multiple sclerosis# 2350329332Multiple sclerosis# 3350329333Depression# 4350329334Hypercholesterolaemia# 5350329335Benign neoplasm of thyroid gland# 6350329336DepressionFR =filter(demography, occr_country=FR)# Filteringdemography of patients from FranceFR$query#SELECT primaryid, caseid, event_dt, age, sex, wt,occr_country# FROM demography# WHEREoccr_country = FR# explain(FR)# # SELECTprimaryid, caseid, event_dt, age, sex, wt,occr_country# FROM demography# WHEREoccr_country = FR# # # #selectid order fromdetail# 1000 SCAN TABLEdemography 通过检索美国患者的信息可以看到 dplyr 包的命令自行产生的数据库检索语句 dplyr 包的命令(select,arrange, filter, mutate, summarize, rename)皆可用于修理数据库中的数据，我们还可以用 magrittr 包中的 pipe 功能（%）将多重命令链接在一起数据分析 + 可视化 ggplot外行人经常认为数据分析师的工作不明觉厉，绘制漂亮高大上的图表，然后从纷繁的数据中探索趋势现象，但业内的人都有这样的体会，很多工作都是洗数据的“体力活”，和真正的数据分析相比，占据了分析师的大量时间和精力。比如我们在做下面几个数据分析例子前，完全可以再多花些时间将数据整理的更完善，这一块我们将会在以后的文章中详解。demography % group_by(country=occr_country) %summarize(total=n() %arrange(desc(total) %filter(country!=) % head(5)#countrytotal# 1US 434526# 2GB18680# 3JP15384# 4CA11530# 5FR11274demography %group_by(country=occr_country) %summarize(total=n() % arrange(desc(total) %filter(country!=) % head(5) % mutate(country =factor(country,levels = countryorder(total) %ggplot(aes(x=country,y=total)+geom_bar(stat=identity,color=blue,fill=yellow)+xlab()+ggtitle(Top Five CountriesWith Highest Number Of AdverseEvents)+theme(plot.title=element_text(size=rel(1.6),lineheight=.9, family=Times, face=bold.italic,colour=dark green)+coord_flip()+ylab(Total Number OfReports)+theme(axis.title.x=element_text(size=15,lineheight=.9, family=Times, face=bold.italic,colour=blue)+theme(axis.text.y=element_text(size=12,family=Times,face=bold.italic, colour=blue)我们注意到由于美国患者人数的众多，使得其他国家的差异在横轴上不再明显，于是我们剔除美国的影响，以便观察不适反应报告较多的其他国家的差异 demography %group_by(country=occr_country) %summarize(total=n() % arrange(desc(total) %filter(country!= & country!=US) % head(10) %mutate(country = factor(country,levels =countryorder(total) %ggplot(aes(x=country,y=total)+geom_bar(stat=identity,color=blue,fill=orange)+xlab()+ggtitle(Top Ten Non-USCountries)+theme(plot.title=element_text(size=rel(1.6),lineheight=.9, family=Times, face=bold.italic,colour=dark green)+coord_flip()+ylab(Total Number OfReports)+theme(axis.title.x=element_text(size=15,lineheight=.9, family=Times, face=bold.italic,colour=blue)+theme(axis.text.y=element_text(size=12,family=Times,face=bold.italic, colour=blue)indication % group_by(indi_pt) %summarise(count=n() % arrange(desc(count) %head(5) #indi_ptcount# 1 Product used for unknown indication 463524#2Diabetes mellitus53742# 3Rheumatoid arthritis47780# 4Multiple sclerosis30946# 5Plasmacell myeloma29256indication %group_by(indi_pt) % summarise(count=n() %arrange(desc(count) % head(6) % tail(-1) %mutate(indi_pt=factor(indi_pt,levels =indi_ptorder(desc(count) % ggplot(aes(x=indi_pt,y=count)+geom_bar(stat=identity,colour=#000099,fill=#000099)+ggtitle(Top Five Indication Counts) + xlab() +ylab()+theme(plot.title =element_text(size = rel(1.6),family=Times, face=bold, colour = black)+theme(axis.text.x=element_text(angle=90,size=12,family=Times, face=bold, colour=blue)我们剔除了计数最多的一项，即不明确患者症状图表表明针对肥胖的药物记录了最多的不适症状，在美国这一现象比较符合预期，众所周知的人口肥胖问题使相关药物使用较为普遍年龄的分布基本分布函数 hist, qplot，和 ggplot 都能用于作图在这里我们移除了小于 0 和大于 100 的年龄记录，这种不符合现实情况的异常值可能源于数据录入出错等原因demography$ageround(as.numeric(demography$age)demography %filter(!is.na(age) & age100 & age0) % select(age) %as.data.frame() % ggplot(aes(x=age)+geom_histogram(breaks=seq(0, 100, by =5), col=darkgrey, aes(fill=.count.) +scale_fill_gradient(Count, low= green, high = red)+labs(title=Age Histogram)+labs(x=Age,y=)+theme(plot.title =element_text(size =rel(1.6), family=Times, face=bold, colour = black)+theme(axis.text.x=element_text(size=10,family=Times,face=bold, colour=black)+theme(axis.title.x=element_text(size=12,family=Times,face=bold, colour=black)合并数据集 joined = demography %inner_join(indication, by=primaryid,copy = TRUE)head(joined,5)#primaryidcaseid event_dt age sexwt occr_country indi_drug_seq# 135032933 35032932000011839F 83US1# 235032933 3503293 2000011839F 83US2# 335032933 3503293 2000011839F 83US3# 435032933 3503293 2000011839F 83US4# 5350329333503293 2000011839F 83US5#indi_pt# 1Multiple sclerosis# 2Multiple sclerosis#3Depression# 4Hypercholesterolaemia# 5 Benign neoplasm of thyroidgland 关于合并数据集的操作有很多种，作者在这里展示了比较常用的 inner_join，而还有right_join,left_join,full_join,semi_join,anti_join 等等的操作可供选择，我们会在以后的文章中具体探讨。数据人网特邀作者和讲师：李悦纽约大学硕士毕业，专业金融传媒，就职于纽约一家卖方投资研究机构，数据分析师，特许金融分析师（CFA）