I’m working currently with a large data set from the Center for Medicare and Medicaid Services (CMS). A few years ago, I analyzed other CMS data that presented a similar pattern:
1) Multiple csv/text files delineated by time.
2) Preponderance of character/string data types, along with a smattering of dates, booleans, and numerics.
3) The various csv/text files may or may not have identical attributes, though there’s much commonality.
4) The data from the union of all columns is to be maintained, with NA’s assigned for missings.
5) The combined data is often quite large by notebook standards — in excess of 5 GB.
6) And since R/Python-Pandas are memory intensive, it’s critical to optimize storage demands.
The challenge I take on below is to efficiently load and store the data from the six large open payments text files into a single R data.table, accommodating an overlapping but not identical attribute structure. In addition, I contrast storing character data attributes as either character or factor data types.
The factor is an R data type used to represent categories of data. The categories (levels) themselves may be any data type but are most often character or text. Factor values on the other hand are denoted by a vector of integers that “point” to the categories. An R example would be variable gender, consisting of levels c(“Female”,”Male”), represented by values c(1,2). One obvious advantage of factors over text data is that the labels are stored just once, potentially saving storage. A second advantage is that factor ordering is flexible based on how they’re constructed.
Customary guidance on the use of factors revolves on smallish data sizes (100,000 records) and counsels to use factors for attributes with cardinality (# of distinct values) in the hundreds or less. With the much larger sizes here, I can let the data do the talking.
The supporting platform is a Wintel 10 notebook with 128 GB RAM, along with software JupyterLab 1.2.4 and R 3.6.2. The R data.table, tidyverse, pryr, plyr, and knitr packages are featured, as well as functions from my personal stash, detailed below.
Set options, load libraries, and import personal functions. Most of these functions are used in this exercise and revolve on gernerating metadata and counts/frequencies of R data.tables.In [1]:
options(warn=-1) options(scipen = 20) options(datatable.print.topn=100) options(datatable.showProgress=FALSE) options(stringsAsFactors=TRUE) usualsuspects <- c( 'tidyverse', 'data.table', 'pryr', 'plyr','readxl', 'bit', 'rvest', 'magrittr','lubridate','rlist', 'tictoc', 'skimr', 'fst','feather', 'knitr', 'kableExtra', 'ggplot2','RColorBrewer' ) suppressMessages(invisible(lapply(usualsuspects, library, character.only = TRUE))) funcsdir <- "/steve/r/functions" funcsfile <- "rfunctions.r" setwd(funcsdir) source(funcsfile) blanks(1) lsf.str() blanks(2)
allfreqs : function (dtn, catlim = 100)
blanks : function (howmany) colsize : function (dt) dtmeta : function (df) freqsdt : function (DTstr, xstr, percent = TRUE) meta : function (df, data = FALSE, dict = TRUE) mksumfreq : function (freqalldt) mksumfreq2 : function (dt) mykab : function (dt) obj_sz : function (obj) prhead : function (df, howmany = 6)
Set the working directory and identify files for loading. These files were previously downloaded from the CMS website.In [2]:
dname <- "c:/bigdata/raw/medicare/cms" setwd(dname) fname <- c("OP_DTL_GNRL_PGYR2013_P01172020.csv","OP_DTL_GNRL_PGYR2014_P01172020.csv", "OP_DTL_GNRL_PGYR2015_P01172020.csv","OP_DTL_GNRL_PGYR2016_P01172020.csv", "OP_DTL_GNRL_PGYR2017_P01172020.csv","OP_DTL_GNRL_PGYR2018_P01172020.csv") blanks(2)
Take a peek at 100 records from each file to determine variable names. The first three files have an identical 65 attributes; the last three have an identical 75 attributes. The union of columns from all files numbers 91, which, after adding one for the file identifier, represents the number of attributes in the aggregate data.table comprised below.In [3]:
frdfnct <- function(f) tolower(names(fread(f,nrows=100))) nmes <- lapply(fname,frdfnct) lapply(nmes,length) blanks(1) setequal(nmes[[1]],nmes[[2]]) setequal(nmes[[1]],nmes[[3]]) setequal(nmes[[1]],nmes[[4]]) setequal(nmes[[4]],nmes[[5]]) setequal(nmes[[4]],nmes[[6]]) blanks(1) length(nmes[[1]]) + length(setdiff(nmes[[4]],nmes[[1]])) length(setdiff(nmes[[4]],nmes[[1]])) + length(setdiff(nmes[[1]],nmes[[4]])) + length(intersect(nmes[[4]],nmes[[1]])) blanks(2)
- 65
- 65
- 65
- 75
- 75
- 75
TRUETRUEFALSETRUETRUE9191
Load the data into a data.table with character attributes stored as factors. Over 60M records with 92 attributes totaling 22 GB RAM. 81 of the attributes are factors. Much of the load cycle is devoted to updating blanks to NA’s. A column present in files 1-3 but not in 4-6 generates NA’s for the missing values. Much of what follows details metadata about the loaded data.table.In [4]:
tic() cmsgf <- rbindlist(lapply(fname, fread, stringsAsFactors=TRUE,na.strings=""),use.names=TRUE, fill=TRUE, idcol="file") %>% setnames(.,tolower(names(.))) cmsgf[,c("date_of_payment","payment_publication_date"):= .(mdy(as.character(date_of_payment)),mdy(as.character(payment_publication_date)))] dtmeta(cmsgf) toc() blanks(2)
|name |class |rows |columns|size | |:----|:----------------------------|:-------|:------|:-------| |cmsgf|c("data.table", "data.frame")|60658797|92 |22.04 GB| |type |N | |:--------|:--| |integer |7 | |factor |81 | |integer64|1 | |numeric |1 | |Date |2 | 421.12 sec elapsed
Tabulate frequencies for each of the 92 attributes, storing the results in a single data.table, limiting the number of categories displayed to 5000/variable.In [5]:
tic() fcmsgf <- allfreqs("cmsgf",catlim=5000) meta(fcmsgf,data=TRUE) blanks(1) toc() blanks(2)
|name |class |rows |columns|size | |:-----|:----------------------------|:-----|:------|:--------| |fcmsgf|c("data.table", "data.frame")|146059|4 |358.45 MB| |vname|value|frequency|percent | |:----|:----|:--------|:--------| |file |4 |11564445 |19.064745| |file |3 |11554363 |19.048124| |file |2 |11316275 |18.655621| |file |5 |11235934 |18.523173| |file |6 |10815629 |17.830273| |file |1 |4172151 |6.878064 | |vname |value |frequency|percent | |:----------------------------------|:-----------|:--------|:-------| |associated_drug_or_biological_ndc_5|0234-0575-17|1 |0.000002| |associated_drug_or_biological_ndc_5|0234-0575-16|1 |0.000002| |associated_drug_or_biological_ndc_5|50419-171-03|1 |0.000002| |associated_drug_or_biological_ndc_5|0008-0100-01|1 |0.000002| |associated_drug_or_biological_ndc_5|58809-888-01|1 |0.000002| |associated_drug_or_biological_ndc_5|0310-6225-60|1 |0.000002| Classes 'data.table' and 'data.frame': 146059 obs. of 4 variables: $ vname : chr "file" "file" "file" "file" ... $ value : Factor w/ 4127783 levels "1","2","3","4",..: 4 3 2 5 6 1 7 9 10 8 ... $ frequency: int 11564445 11554363 11316275 11235934 10815629 4172151 60614943 38051 5418 385 ... $ percent : num 19.1 19 18.7 18.5 17.8 ... - attr(*, ".internal.selfref")=<externalptr> NULL 72.34 sec elapsed
Summarize the frequencies — one record per attribute.In [6]:
prhead(mksumfreq(fcmsgf)) blanks(2)
vname N sumfreq sumpct I 1: file 6 60658797 100.00000 1 2: change_type 4 60658797 100.00000 2 3: covered_recipient_type 2 60658797 100.00000 3 4: teaching_hospital_ccn 1289 60658797 100.00000 4 5: teaching_hospital_id 5000 60654422 99.99279 5 6: teaching_hospital_name 3817 60658797 100.00000 6 vname N sumfreq 1: associated_drug_or_biological_ndc_4 481 60658797 2: covered_or_noncovered_indicator_5 3 60658797 3: indicate_drug_or_biological_or_device_or_medical_supply_5 5 60658797 4: product_category_or_therapeutic_area_5 476 60658797 5: name_of_drug_or_biological_or_device_or_medical_supply_5 2126 60658797 6: associated_drug_or_biological_ndc_5 278 60658797 sumpct I 1: 100 87 2: 100 88 3: 100 89 4: 100 90 5: 100 91 6: 100 92
List the first 6 attributes present in files 1-3 but not in 4-6. Then identify the first 6 attributes present in files 4-6 but not in 1-3.In [7]:
head(setdiff(nmes[[1]],nmes[[4]])) blanks(1) head(setdiff(nmes[[4]],nmes[[1]])) blanks(2)
- ‘product_indicator’
- ‘name_of_associated_covered_drug_or_biological1’
- ‘name_of_associated_covered_drug_or_biological2’
- ‘name_of_associated_covered_drug_or_biological3’
- ‘name_of_associated_covered_drug_or_biological4’
- ‘name_of_associated_covered_drug_or_biological5’
- ‘related_product_indicator’
- ‘covered_or_noncovered_indicator_1’
- ‘indicate_drug_or_biological_or_device_or_medical_supply_1’
- ‘product_category_or_therapeutic_area_1’
- ‘name_of_drug_or_biological_or_device_or_medical_supply_1’
- ‘associated_drug_or_biological_ndc_1’
Confirm NA’s by file for non-common columns using the flexible freqsdt function. Check.In [8]:
mykab(freqsdt("cmsgf","file,product_indicator")[order(file)]) blanks(1) mykab(freqsdt("cmsgf","file,related_product_indicator")[order(file)]) blanks(2)
|file|product_indicator|frequency|percent | |:---|:----------------|:--------|:--------| |1 |Covered |3827241 |6.309457 | |1 |None |184494 |0.304150 | |1 |Non-Covered |136125 |0.224411 | |1 |Combination |24291 |0.040045 | |2 |Covered |10380620 |17.113132| |2 |None |528144 |0.870680 | |2 |Non-Covered |357677 |0.589654 | |2 |Combination |49834 |0.082155 | |3 |Covered |10651317 |17.559394| |3 |None |531260 |0.875817 | |3 |Non-Covered |307937 |0.507654 | |3 |Combination |63849 |0.105259 | |4 |NA |11564445 |19.064745| |5 |NA |11235934 |18.523173| |6 |NA |10815629 |17.830273|
|file|related_product_indicator|frequency|percent | |:---|:------------------------|:--------|:--------| |1 |NA |4172151 |6.878064 | |2 |NA |11316275 |18.655621| |3 |NA |11554363 |19.048124| |4 |Yes |10986341 |18.111703| |4 |No |578104 |0.953042 | |5 |Yes |10698759 |17.637605| |5 |No |537175 |0.885568 | |6 |Yes |10283818 |16.953548| |6 |No |531811 |0.876725 |
Now conduct the same loading exercise storing the attributes as character rather than factor. Note that elapsed load time is more than double the previous for factor, and the RAM demands, at 40 GB, are almost double.In [9]:
tic() cmsgs <- rbindlist(lapply(fname, fread, stringsAsFactors=FALSE,na.strings=""),use.names=TRUE, fill=TRUE, idcol="file") %>% setnames(.,tolower(names(.))) cmsgs[,c("date_of_payment","payment_publication_date"):= .(mdy(date_of_payment),mdy(payment_publication_date))] dtmeta(cmsgs) toc() blanks(2)
|name |class |rows |columns|size | |:----|:----------------------------|:-------|:------|:-------| |cmsgs|c("data.table", "data.frame")|60658797|92 |40.31 GB| |type |N | |:--------|:--| |integer |7 | |character|81 | |integer64|1 | |numeric |1 | |Date |2 | 999.27 sec elapsed
Tally frequencies for each of the 92 attributes in the character strings version of the data.table. The elapsed time for this operation is double that of the factor version.In [10]:
tic() fcmsgs <- allfreqs("cmsgs",catlim=5000) meta(fcmsgs,data=TRUE) blanks(1) toc() blanks(2)
|name |class |rows |columns|size | |:-----|:----------------------------|:-----|:------|:-------| |fcmsgs|c("data.table", "data.frame")|146059|4 |10.92 MB| |vname|value|frequency|percent | |:----|:----|:--------|:--------| |file |4 |11564445 |19.064745| |file |3 |11554363 |19.048124| |file |2 |11316275 |18.655621| |file |5 |11235934 |18.523173| |file |6 |10815629 |17.830273| |file |1 |4172151 |6.878064 | |vname |value |frequency|percent | |:----------------------------------|:-----------|:--------|:-------| |associated_drug_or_biological_ndc_5|0234-0575-17|1 |0.000002| |associated_drug_or_biological_ndc_5|0234-0575-16|1 |0.000002| |associated_drug_or_biological_ndc_5|50419-171-03|1 |0.000002| |associated_drug_or_biological_ndc_5|0008-0100-01|1 |0.000002| |associated_drug_or_biological_ndc_5|58809-888-01|1 |0.000002| |associated_drug_or_biological_ndc_5|0310-6225-60|1 |0.000002| Classes 'data.table' and 'data.frame': 146059 obs. of 4 variables: $ vname : chr "file" "file" "file" "file" ... $ value : Factor w/ 103193 levels "1","2","3","4",..: 4 3 2 5 6 1 10 8 9 7 ... $ frequency: int 11564445 11554363 11316275 11235934 10815629 4172151 60614943 38051 5418 385 ... $ percent : num 19.1 19 18.7 18.5 17.8 ... - attr(*, ".internal.selfref")=<externalptr> NULL 160.92 sec elapsed
Summarize the frequencies by attribute name.In [11]:
prhead(mksumfreq(fcmsgs)) blanks(2)
vname N sumfreq sumpct I 1: file 6 60658797 100.00000 1 2: change_type 4 60658797 100.00000 2 3: covered_recipient_type 2 60658797 100.00000 3 4: teaching_hospital_ccn 1289 60658797 100.00000 4 5: teaching_hospital_id 5000 60654422 99.99279 5 6: teaching_hospital_name 3817 60658797 100.00000 6 vname N sumfreq 1: associated_drug_or_biological_ndc_4 481 60658797 2: covered_or_noncovered_indicator_5 3 60658797 3: indicate_drug_or_biological_or_device_or_medical_supply_5 5 60658797 4: product_category_or_therapeutic_area_5 476 60658797 5: name_of_drug_or_biological_or_device_or_medical_supply_5 2126 60658797 6: associated_drug_or_biological_ndc_5 278 60658797 sumpct I 1: 100 87 2: 100 88 3: 100 89 4: 100 90 5: 100 91 6: 100 92
Next, write a function to denote the storage allocations for each attribute of a data.table. This will allow us to compare the size of character attributes against the corresponding factors on an attribute basis. colsize deploys a functional coding metaphor.In [12]:
colsize <- function(dt) { f <- function(nm) data.table(cname=nm,type=class(dt[[nm]]),cats=length(unique(dt[[nm]])),sz=obj_sz(dt[[nm]]),osz=object_size(dt[[nm]])) map_df(names(dt), f)[,c("cumosz","I"):=.(cumsum(osz),.I)] } blanks(2)
First up, the factor data.table version. For each column, there’s the name, data type, # distinct categories, size, size in bytes, and cumulative size in bytes.In [13]:
tic() colszf = colsize(cmsgf) setnames(colszf,paste0(names(colszf),"_f")) prhead(colszf) toc() blanks(2)
cname_f type_f cats_f sz_f osz_f cumosz_f I_f 1: file integer 6 231.39 MB 242635240 242635240 1 2: change_type factor 4 231.4 MB 242635888 485271128 2 3: covered_recipient_type factor 2 231.4 MB 242635816 727906944 3 4: teaching_hospital_ccn integer 1289 231.39 MB 242635240 970542184 4 5: teaching_hospital_id integer 6679 231.39 MB 242635240 1213177424 5 6: teaching_hospital_name factor 3817 231.73 MB 242991336 1456168760 6 cname_f type_f cats_f 1: associated_drug_or_biological_ndc_4 factor 481 2: covered_or_noncovered_indicator_5 factor 3 3: indicate_drug_or_biological_or_device_or_medical_supply_5 factor 5 4: product_category_or_therapeutic_area_5 factor 476 5: name_of_drug_or_biological_or_device_or_medical_supply_5 factor 2126 6: associated_drug_or_biological_ndc_5 factor 278 sz_f osz_f cumosz_f I_f 1: 231.43 MB 242670184 22465400912 87 2: 231.4 MB 242635760 22708036672 88 3: 231.4 MB 242635896 22950672568 89 4: 231.43 MB 242673936 23193346504 90 5: 231.56 MB 242805336 23436151840 91 6: 231.41 MB 242655568 23678807408 92 84.96 sec elapsed
Now the character attribute version. Note the order of magnitude difference in execution times.In [15]:
tic() colszs = colsize(cmsgs) setnames(colszs,paste0(names(colszs),"_s")) prhead(colszs) toc() blanks(2)
cname_s type_s cats_s sz_s osz_s cumosz_s I_s 1: file integer 6 231.39 MB 242635240 242635240 1 2: change_type character 4 462.79 MB 485270656 727905896 2 3: covered_recipient_type character 2 462.79 MB 485270600 1213176496 3 4: teaching_hospital_ccn integer 1289 231.39 MB 242635240 1455811736 4 5: teaching_hospital_id integer 6679 231.39 MB 242635240 1698446976 5 6: teaching_hospital_name character 3817 463.1 MB 485595664 2184042640 6 cname_s type_s cats_s 1: associated_drug_or_biological_ndc_4 character 481 2: covered_or_noncovered_indicator_5 character 3 3: indicate_drug_or_biological_or_device_or_medical_supply_5 character 5 4: product_category_or_therapeutic_area_5 character 476 5: name_of_drug_or_biological_or_device_or_medical_supply_5 character 2126 6: associated_drug_or_biological_ndc_5 character 278 sz_s osz_s cumosz_s I_s 1: 462.82 MB 485301200 40871210024 87 2: 462.79 MB 485270600 41356480624 88 3: 462.79 MB 485270720 41841751344 89 4: 462.82 MB 485304992 42327056336 90 5: 462.94 MB 485423192 42812479528 91 6: 462.81 MB 485288208 43297767736 92 1498.73 sec elapsed
Join the two data.tables and display. What’s striking is the difference in storage between character and factor, with a marked advantage for factor — even among attributes that assume tens or hundreds of thousands of distinct values. The difference in memory between the character data.table version (40 GB) and the factor version (22 GB) is striking.In [16]:
colszall <- inner_join(colszf,colszs,by=c("cname_f"="cname_s")) %>% data.table meta(colszall) blanks(2)
|name |class |rows|columns|size | |:-------|:----------------------------|:---|:------|:-------| |colszall|c("data.table", "data.frame")|92 |13 |23.26 KB| Classes 'data.table' and 'data.frame': 92 obs. of 13 variables: $ cname_f : chr "file" "change_type" "covered_recipient_type" "teaching_hospital_ccn" ... $ type_f : chr "integer" "factor" "factor" "integer" ... $ cats_f : int 6 4 2 1289 6679 3817 1032290 120912 108105 378249 ... $ sz_f : chr "231.39 MB" "231.4 MB" "231.4 MB" "231.39 MB" ... $ osz_f : num 242635240 242635888 242635816 242635240 242635240 ... $ cumosz_f: num 242635240 485271128 727906944 970542184 1213177424 ... $ I_f : int 1 2 3 4 5 6 7 8 9 10 ... $ type_s : chr "integer" "character" "character" "integer" ... $ cats_s : int 6 4 2 1289 6679 3817 1032290 120912 108105 378249 ... $ sz_s : chr "231.39 MB" "462.79 MB" "462.79 MB" "231.39 MB" ... $ osz_s : num 242635240 485270656 485270600 242635240 242635240 ... $ cumosz_s: num 242635240 727905896 1213176496 1455811736 1698446976 ... $ I_s : int 1 2 3 4 5 6 7 8 9 10 ... - attr(*, ".internal.selfref")=<externalptr> NULL
In [17]:
print(mykab(colszall[,.(I_f,cname_f,cats_f,type_f,sz_f,type_s,sz_s)])) blanks(2)
|I_f|cname_f |cats_f |type_f |sz_f |type_s |sz_s | |:--|:----------------------------------------------------------------|:-------|:--------|:--------|:--------|:--------| |1 |file |6 |integer |231.39 MB|integer |231.39 MB| |2 |change_type |4 |factor |231.4 MB |character|462.79 MB| |3 |covered_recipient_type |2 |factor |231.4 MB |character|462.79 MB| |4 |teaching_hospital_ccn |1289 |integer |231.39 MB|integer |231.39 MB| |5 |teaching_hospital_id |6679 |integer |231.39 MB|integer |231.39 MB| |6 |teaching_hospital_name |3817 |factor |231.73 MB|character|463.1 MB | |7 |physician_profile_id |1032290 |integer |231.39 MB|integer |231.39 MB| |8 |physician_first_name |120912 |factor |239.11 MB|character|469.58 MB| |9 |physician_middle_name |108105 |factor |238.32 MB|character|468.89 MB| |10 |physician_last_name |378249 |factor |256.02 MB|character|484.53 MB| |11 |physician_name_suffix |271 |factor |231.41 MB|character|462.8 MB | |12 |recipient_primary_business_street_address_line1 |1536978 |factor |355.83 MB|character|575.5 MB | |13 |recipient_primary_business_street_address_line2 |271707 |factor |252.64 MB|character|481.97 MB| |14 |recipient_city |35417 |factor |233.79 MB|character|464.91 MB| |15 |recipient_state |65 |factor |231.4 MB |character|462.79 MB| |16 |recipient_zip_code |366610 |factor |256.37 MB|character|484.96 MB| |17 |recipient_country |53 |factor |231.4 MB |character|462.79 MB| |18 |recipient_province |169 |factor |231.41 MB|character|462.8 MB | |19 |recipient_postal_code |443 |factor |231.42 MB|character|462.81 MB| |20 |physician_primary_type |7 |factor |231.4 MB |character|462.79 MB| |21 |physician_specialty |563 |factor |231.48 MB|character|462.87 MB| |22 |physician_license_state_code1 |61 |factor |231.4 MB |character|462.79 MB| |23 |physician_license_state_code2 |59 |factor |231.4 MB |character|462.79 MB| |24 |physician_license_state_code3 |57 |factor |231.4 MB |character|462.79 MB| |25 |physician_license_state_code4 |53 |factor |231.4 MB |character|462.79 MB| |26 |physician_license_state_code5 |48 |factor |231.4 MB |character|462.79 MB| |27 |submitting_applicable_manufacturer_or_applicable_gpo_name |2513 |factor |231.6 MB |character|462.98 MB| |28 |applicable_manufacturer_or_applicable_gpo_making_payment_id |2390 |integer64|462.79 MB|integer64|462.79 MB| |29 |applicable_manufacturer_or_applicable_gpo_making_payment_name |2705 |factor |231.62 MB|character|462.99 MB| |30 |applicable_manufacturer_or_applicable_gpo_making_payment_state |51 |factor |231.4 MB |character|462.79 MB| |31 |applicable_manufacturer_or_applicable_gpo_making_payment_country |51 |factor |231.4 MB |character|462.79 MB| |32 |total_amount_of_payment_usdollars |270670 |numeric |462.79 MB|numeric |462.79 MB| |33 |date_of_payment |1995 |Date |462.79 MB|Date |462.79 MB| |34 |number_of_payments_included_in_total_amount |163 |integer |231.39 MB|integer |231.39 MB| |35 |form_of_payment_or_transfer_of_value |7 |factor |231.4 MB |character|462.79 MB| |36 |nature_of_payment_or_transfer_of_value |15 |factor |231.4 MB |character|462.79 MB| |37 |city_of_travel |25492 |factor |233.12 MB|character|464.32 MB| |38 |state_of_travel |62 |factor |231.4 MB |character|462.79 MB| |39 |country_of_travel |158 |factor |231.41 MB|character|462.8 MB | |40 |physician_ownership_indicator |3 |factor |231.4 MB |character|462.79 MB| |41 |third_party_payment_recipient_indicator |3 |factor |231.4 MB |character|462.79 MB| |42 |name_of_third_party_entity_receiving_payment_or_transfer_of_value|54991 |factor |236.04 MB|character|467.01 MB| |43 |charity_indicator |3 |factor |231.4 MB |character|462.79 MB| |44 |third_party_equals_covered_recipient_indicator |3 |factor |231.4 MB |character|462.79 MB| |45 |contextual_information |1310350 |factor |373.01 MB|character|594.4 MB | |46 |delay_in_publication_indicator |1 |factor |231.4 MB |character|462.79 MB| |47 |record_id |60658797|integer |231.39 MB|integer |231.39 MB| |48 |dispute_status_for_publication |2 |factor |231.4 MB |character|462.79 MB| |49 |product_indicator |5 |factor |231.4 MB |character|462.79 MB| |50 |name_of_associated_covered_drug_or_biological1 |4284 |factor |231.73 MB|character|463.09 MB| |51 |name_of_associated_covered_drug_or_biological2 |1514 |factor |231.5 MB |character|462.88 MB| |52 |name_of_associated_covered_drug_or_biological3 |1044 |factor |231.47 MB|character|462.86 MB| |53 |name_of_associated_covered_drug_or_biological4 |633 |factor |231.44 MB|character|462.83 MB| |54 |name_of_associated_covered_drug_or_biological5 |363 |factor |231.42 MB|character|462.81 MB| |55 |ndc_of_associated_covered_drug_or_biological1 |7626 |factor |231.92 MB|character|463.26 MB| |56 |ndc_of_associated_covered_drug_or_biological2 |1237 |factor |231.48 MB|character|462.87 MB| |57 |ndc_of_associated_covered_drug_or_biological3 |902 |factor |231.46 MB|character|462.84 MB| |58 |ndc_of_associated_covered_drug_or_biological4 |574 |factor |231.43 MB|character|462.82 MB| |59 |ndc_of_associated_covered_drug_or_biological5 |322 |factor |231.42 MB|character|462.81 MB| |60 |name_of_associated_covered_device_or_medical_supply1 |11113 |factor |232.31 MB|character|463.62 MB| |61 |name_of_associated_covered_device_or_medical_supply2 |2728 |factor |231.6 MB |character|462.98 MB| |62 |name_of_associated_covered_device_or_medical_supply3 |1692 |factor |231.52 MB|character|462.91 MB| |63 |name_of_associated_covered_device_or_medical_supply4 |996 |factor |231.47 MB|character|462.86 MB| |64 |name_of_associated_covered_device_or_medical_supply5 |545 |factor |231.44 MB|character|462.83 MB| |65 |program_year |6 |integer |231.39 MB|integer |231.39 MB| |66 |payment_publication_date |1 |Date |462.79 MB|Date |462.79 MB| |67 |related_product_indicator |3 |factor |231.4 MB |character|462.79 MB| |68 |covered_or_noncovered_indicator_1 |3 |factor |231.4 MB |character|462.79 MB| |69 |indicate_drug_or_biological_or_device_or_medical_supply_1 |5 |factor |231.4 MB |character|462.79 MB| |70 |product_category_or_therapeutic_area_1 |3758 |factor |231.69 MB|character|463.05 MB| |71 |name_of_drug_or_biological_or_device_or_medical_supply_1 |22888 |factor |233.2 MB |character|464.42 MB| |72 |associated_drug_or_biological_ndc_1 |2107 |factor |231.54 MB|character|462.92 MB| |73 |covered_or_noncovered_indicator_2 |3 |factor |231.4 MB |character|462.79 MB| |74 |indicate_drug_or_biological_or_device_or_medical_supply_2 |5 |factor |231.4 MB |character|462.79 MB| |75 |product_category_or_therapeutic_area_2 |1166 |factor |231.49 MB|character|462.87 MB| |76 |name_of_drug_or_biological_or_device_or_medical_supply_2 |7009 |factor |231.92 MB|character|463.26 MB| |77 |associated_drug_or_biological_ndc_2 |1097 |factor |231.47 MB|character|462.86 MB| |78 |covered_or_noncovered_indicator_3 |3 |factor |231.4 MB |character|462.79 MB| |79 |indicate_drug_or_biological_or_device_or_medical_supply_3 |5 |factor |231.4 MB |character|462.79 MB| |80 |product_category_or_therapeutic_area_3 |872 |factor |231.46 MB|character|462.85 MB| |81 |name_of_drug_or_biological_or_device_or_medical_supply_3 |5201 |factor |231.78 MB|character|463.14 MB| |82 |associated_drug_or_biological_ndc_3 |791 |factor |231.45 MB|character|462.84 MB| |83 |covered_or_noncovered_indicator_4 |3 |factor |231.4 MB |character|462.79 MB| |84 |indicate_drug_or_biological_or_device_or_medical_supply_4 |5 |factor |231.4 MB |character|462.79 MB| |85 |product_category_or_therapeutic_area_4 |631 |factor |231.44 MB|character|462.83 MB| |86 |name_of_drug_or_biological_or_device_or_medical_supply_4 |3295 |factor |231.64 MB|character|463.01 MB| |87 |associated_drug_or_biological_ndc_4 |481 |factor |231.43 MB|character|462.82 MB| |88 |covered_or_noncovered_indicator_5 |3 |factor |231.4 MB |character|462.79 MB| |89 |indicate_drug_or_biological_or_device_or_medical_supply_5 |5 |factor |231.4 MB |character|462.79 MB| |90 |product_category_or_therapeutic_area_5 |476 |factor |231.43 MB|character|462.82 MB| |91 |name_of_drug_or_biological_or_device_or_medical_supply_5 |2126 |factor |231.56 MB|character|462.94 MB| |92 |associated_drug_or_biological_ndc_5 |278 |factor |231.41 MB|character|462.81 MB|
The findings above strongly suggest that for handling large data.tables in R factors should be preferred to character for text attributes. Not only is there a significant storage savings — even with high cardinality columns — but likely a performance benefit as well. Add to that the flexibility of different orderings for factors and the case becomes quite compelling.