Like all features in the knowledge.desk R bundle, fread is speedy. Extremely speedy. But there’s extra to fread than speed. It has several beneficial features and alternatives when importing exterior knowledge into R. Here are 5 of the most practical.
Notice: If you’d like to observe along, obtain the New York Situations CSV file of daily Covid-19 situations by U.S. county at https://github.com/nytimes/covid-19-knowledge/raw/learn/us-counties.csv.
Use fread’s nrows solution
Is your file large? Would you like to analyze its construction ahead of importing the whole point – without having to open it in a textual content editor or Excel? Use fread’s
nrows solution to import only a portion of a file for exploration.
The code underneath imports just the initially ten rows of the CSV.
mydt10 <- fread("us-counties.csv", nrows = 10)
If you just want to see column names without any knowledge at all, you can use
nrows = .
Use fread’s pick solution
When you know the file construction, you can pick out which columns to import. fread’s
pick solution lets you decide columns you want to continue to keep.
pick can take a vector of either column names or column-placement figures. If names, they need to be in quotation marks, like most vectors of character strings:
mydt <- fread("us-counties.csv",
pick = c("date", "county", "state", "situations"))
As constantly, figures don’t need quotation marks:
mydt <- fread("us-counties.csv", select = c(1,2,3,5))
You can use an R object with a vector of column names inside of fread, as you can see in this subsequent group of code. I produce a vector my_cols with date, county, state, and situations then I use that vector inside of fread.
my_cols <- c("date", "county", "state", "cases")
mydt <- fread("us-counties.csv", select = my_cols)
The reverse of
drop. You can pick out to import all columns apart from the ones you specify with
drop, these as:
mydt <- fread("us-counties.csv", drop = c("fips", "deaths"))
drop can take a vector of column names or numerical positions.
Use fread with grep
If you are acquainted with Unix, you can execute command-line tools correct from inside of fread. For example, if I just required California knowledge, I could use grep to only import strains that consist of the textual content “California.” Notice that this queries every overall row as a textual content string, not a precise column, so your knowledge has to be in a structure wherever that makes perception.
ca <- fread("grep California us-counties.csv")
Regretably, grep doesn’t have an understanding of the original file’s column names, so you close up with default names.
head(ca) V1 V2 V3 V4 V5 V6 1: 2020-01-twenty five Orange California 6059 1 two: 2020-01-26 Los Angeles California 6037 1 three: 2020-01-26 Orange California 6059 1 4: 2020-01-27 Los Angeles California 6037 1 five: 2020-01-27 Orange California 6059 1 6: 2020-01-28 Los Angeles California 6037 1
However, fread lets us specify column names with the
col.names solution. I can established the names dependent on names from mydt10 that I designed higher than.
ca <- fread("grep California us-counties.csv", col.names = names(mydt10))> head(ca) date county state fips situations fatalities 1: 2020-01-twenty five Orange California 6059 1 two: 2020-01-26 Los Angeles California 6037 1 three: 2020-01-26 Orange California 6059 1 4: 2020-01-27 Los Angeles California 6037 1 five: 2020-01-27 Orange California 6059 1 6: 2020-01-28 Los Angeles California 6037 1
We can also use typical expressions, with grep’s
-E solution, allowing us do extra complicated queries, these as seeking for four states at when.
states4 <- fread(cmd = "grep -E 'Texas|Arizona|Florida|South Carolina' us-counties.csv",
col.names = names(mydt10))
When all over again, a reminder: This is seeking for every of those people state names anyplace in the row, not just in the state column. If you run the code higher than and examine what states are provided in the benefits with
distinctive(states4$state), you will see Oklahoma and Missouri in the states column along with Texas, Arizona, Florida, and South Carolina. That is simply because each Oklahoma and Missouri have counties named Texas.
So, grep through file import is a way to filter out a whole lot of knowledge you don’t want from a quite large knowledge established but it doesn’t warranty you only get what you want. Right after this kind of import, you should even now filter exclusively on column knowledge to make absolutely sure you didn’t get anything at all surprising.
Use fread’s colClasses solution
You can established column lessons through import – for just a couple of columns, not each 1. For example, the date column in this knowledge is coming in as character strings, even though it’s in yr-month-working day structure. We can established the column named date to the knowledge style Date during import applying the
mydt <- fread("us-counties.csv", colClasses = c("date" = "Date"))
Now, dates are Dates.
> str(mydt) Courses ‘data.table’ and 'data.frame':322651 obs. of 6 variables: $ date : Date, structure: "2020-01-21" "2020-01-22" "2020-01-23" ... $ county: chr "Snohomish" "Snohomish" "Snohomish" "Prepare dinner" ... $ state : chr "Washington" "Washington" "Washington" "Illinois" ... $ fips : int 53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ... $ situations : int 1 1 1 1 1 1 1 1 1 1 ... $ fatalities: int ...
Use fread on zipped data files
You can import a zipped file without unzipping it initially. fread can import gz and bz2 data files straight, these as
mydt <- fread("myfile.gz"). If you need to import a zip file, you can unzip it with the
unzip method command within fread, applying the syntax
mydt <- fread(cmd = 'unzip -cq myfile.zip').
For extra R suggestions, head to InfoWorld’s Do More With R site.
Copyright © 2020 IDG Communications, Inc.