War between Data Formats

In this page, I am going to list all the papers related to data formats, existing benchmarks and videos. This might be useful for those who want to see which data format is going to win the war.


Research Papers:

These are some research papers related to existing data formats and proposed storage layouts till now. You can observe the trend: first it was going from plain storage formats to binary and then, it shifts within the binary from row storage formats to columnar storage formats.

  1. D. J. Abadi, S. R. Madden, N. Hachem. Column-Stores vs. Row-Stores: How Different Are They Really?. In SIGMOD 2008.
  2. A. Jindal, J.-A. Quian-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. In SOCC, 2011.
  3. Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011.
  4. A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-Oriented Storage Techniques for MapReduce. In VLDB, 2011.
  5. I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A Hands-free Adaptive Store. In SIGMOD, 2014.
  6. T. Xu, D. Wang. KCGS-Store: A Columnar Storage Based On Group Sorting Of Key Columns. In Cloud, 2016.
  7. R. F. Munir, O. Romero, A. Abello, B. Bilalli, M. Thiele, W. Lehner. ResilientStore: A Heuristic-based Data Format Selector for Intermediate Results. In: MEDI 2016.
  8. T. Ivanov and M. Pergolesi: The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet. In: Concurrency and Computation: Practice and Experience 2020

Existing Benchmarks:

  1. CERN compares two data formats (Avro and Parquet) with two storage engines (Hbase and Kudu). They concluded that Parquet and Kudu are good for analytical workloads. [https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines]
  2. SVDS compares different data formats which include Plain Text, Sequence Files, Avro, Parquet and ORC. Their results show that Avro is good for scan-based workload whereas Parquet and ORC are good for OLAP workloads. [http://www.svds.com/dataformats/]
  3. Horton also benchmarks JSON, Avro, ORC and Parquet. Their presentation is available on slideshare. [http://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet]
  4. Huawei with some other companies introduces a new file format Apache CarbonData. This data format is also allowed to insert, delete and update. Moreover, it also supports indexing. It is also a columnar format and good for OLAP queries. [http://carbondata.incubator.apache.org/]



  1. Apache Parquet 2013 [https://www.youtube.com/watch?v=pFS-FScophU&list=PLA70L35Y7kjgvArPec7s6j-lJRJkGM1Yc]
  2. Apache Parquet 2014 [https://www.youtube.com/watch?v=MZNjmfx4LMc&index=4&list=PLA70L35Y7kjgvArPec7s6j-lJRJkGM1Yc]
  3. Horton File Formats Benchmark 2016 [https://www.youtube.com/watch?v=tB28rPTvRiI]
  4. Apache Spark with Parquet 2017 [https://www.youtube.com/watch?v=_0Wpwj_gvzg]
  5. Audio about Apache Parquet and Apache Arrow 2017 [https://softwareengineeringdaily.com/2017/01/13/columnar-data-apache-arrow-and-parquet-with-julien-le-dem-and-jacques-nadeau/]