After using Julia successfully in a few new projects, my latest came to a sticky end because of how it saves large data files.
I have used more languages than I remember for numerical work. About a year ago, I wrote a blog titled Which programming language is best for economic research: Julia, Matlab, Python or R? where I recommended Julia.
I am a fan, just used her successfully in a project I ported from Python/Pandas/Numpy. She was three times as fast as Python, and the code was a third of Python’s size. Fantastic.
So being all fired up, we tried Julia on a new project involving CRSP data. The CRSP date on my computer is about 1 gb compressed csv, and we will run a sequence of different estimation methods on it, each creating files of about 24 million rows and perhaps 7 columns, three integers and rest floating point. It might take a few hours, even days, to create each of these.
The next thing is to process the data, perhaps summary statistics by month, and this is when Julia shines. Doing the aggregate calculation with R data.table took 2.4 seconds while Julia dataframes needed only 1.18 seconds.
The R code
df[,.SD[MARKET_CAP < quantile(MARKET_CAP,0.10), list(length(alpha),mean(alpha), median(alpha),mean(A), median(A), sd(RET),mean(MARKET_CAP))], by = month ]
combine(groupby(df, [:month])) do sdf dfs = sdf[sdf.MARKET_CAP .< quantile(sdf.MARKET_CAP,0.1),:] d=DataFrame(N=length(dfs.alpha),alpha_mean=mean(dfs.alpha, alpha_median=median(dfs.alpha) ,A_mean=mean(dfs.A),A_median=median(dfs.A),sd=std(dfs.RET), MARKET_CAP=mean(dfs.MARKET_CAP)) return d end
24 million rows and 7 columns is a lot of data and results in an uncompressed CSV file of around 2 gb, which must be synced to various computers and backed up.
The file size is important, especially when the files are this large. Supposing they get used for a paper, I will need to be confident I can read the files several years in the future, even decades, into the future.
So for any method for storing my results, I have three criteria, how are they compress, what is the file size and if I am confident I can read the file in the distant future.
One criteria missing is speed. When it takes 24 hours to generate some research output, it makes no difference whether saving the results takes one second or one minute.
RData files look good on all three criteria. I have some going back a decade, and they always open. Same with CSV files. But, csv is not the best file format for numerical data, as I will discuss below.
What about Julia? My data is in data frames, and I used the methods discussed in the excellent dataframe tutorial by Bogumił Kamiński and compared the various alternatives by file size and research suitability.
|CSV from R||none||very high||1.9 gb.|
|CSV from R||gzip||very high||734 mb.|
|CSV from R||xz||very high||220 mb.|
|CSV from Julia||none||very high||2.1 gb.|
|CSV from Julia||xz||very high||233 mb.|
|Julia||serialize()||none||very low||1.3 gb.|
|Julia||serialize()||xz||very low||135 mb.|
The file format with the highest confidence is uncompressed CSV, but that took 1.9 gb.
So makes sense to compress. We have tried all of the common methods, and at least for our data, xz is by far the best. Beating gzip, bz2 and zip by a large margin.
An xz compressed R DataFrame is the smallest, at 132 Mb. Since R write() and read() are built into the language, I am pretty confident these files will be readable for the foreseeable future.
Julia serialize() just writes the internal data frame representation to a disk and is not guaranteed to be readable between Julia versions. Hence very low confidence. It doesn’t offer compression, but by doing it externally with xz, the file size was reduced from 1.3 gb to 135 mb.
Julia CSV is a good choice, but the files are huge at 2.1 gb, and even externally xz compressed are 233 mb. gzip would be much worse.
Julia Arrow and Parquet might be a good choice, with high confidence, but these are huge files. Arrow isn’t compressed, but Parquet is supposed to be an efficient storage format. At least for my data it isn’t. One could, of course, compress them afterwards. The xz compressed arrow file was the smallest.
JDF does compress, but not by a huge amount. Also, unique among all the methods, it writes a folder with one file per column and one with metadata. I could certainly see a use case, but the folders are annoying to me, and given the talk about format changes on the Github page, I am not confident about future readability.
So what to use? The easiest and safest would just be to do it all in R.
If I were to use Julia for this project, I would have to use CSV.write() or arrow with external (command line) xz compression.
It is tempting to use one of JLD/JLD2/JDF with external compression. I will not do that. If there are three competing packages for saving Julia data, I have no confidence the one I pick will be around and supported in a decade.
I think it would be important for Julia adoption if it allowed one to save compressed (maybe both gzip and xz) serialized variables that are guaranteed to be readable for the life of Julia as a language. This should not be provided in a third party package, which comes with additional risk, but should be a part of the core language, just like R save().
So why serialise and not just writing data as csv? Three reasons.
The first is that if the csv writer uses full numerical precision, the files become very large, and if it rounds like R fwrite() then one loses data (which made our life difficult a few weeks ago). You can see the difference in R and Julia csv write here, first line:
R, PERMNO,date,alpha,A,RET,MARKET_CAP,X.m1 10241,19290510,2.98480615495697,6.21303834131563e-07,0.04287,368626,0.038961 Julia, PERMNO,date,alpha,A,RET,MARKET_CAP,X_m1 10241.0,1.929051e7,2.9848061549569698,6.213038341315627e-7,0.04287,368626.0,0.038961
Second, a data frame contains data in Julia’s internal format, which could be an integer, float, string, complex, or anything one can put into a data frame, today and in the future. I would not be confident that making a round-trip to CSV would make that data identical after the round-trip.
And finally, I keep a lot of metadata in dictionaries. So the data I work with is a dictionary with data frames, the variables, and versions used to generate the data, time, computer, and other relevant variables. In R one can just use save(List,file=”x.Rdata”,compress=”xz”). At least for my workflow, that is helpful.
csv just isn’t an efficient format for numerical data. You can see that by the 2.1 gb. file size for Julia’s csv export vs. 1.3 gb. for serialize. Neither is compressed. The R csv becomes 734mb and 220gb when compressed with gzip and xz, respectively. Still almost the double size of their serialised counterparts.
csv is useful for sharing data between applications and archival storage, but it is not an optimal format if one is only using a single application. Arrow and parquet might be better for dataframes, but don’t help with Dictionaries.
So, it would be best if Julia could create xz compressed stable serialised files of its variables.
And finally, if I were to make one recommendation to the excellent Julia development team it would be to ensure that a google query for “how to save data in Julia” only delivered the one recommended way, just like “how to save data in R” does. The confusing message from googling “how to save data in Julia” is why I wrote this blog.
© All rights reserved, Jon Danielsson, 2021