大数据系统Benchmark测试综述Overview on Benchmark Test of Big Data System
闫义博;朱文强;杨仝;李晓明;
摘要(Abstract):
Benchmark测试是目前最主要的计算机系统性能评测技术,其主要使用科学的评测方法测试系统的某些可量化性能指标,并通过对比不同的系统的测试结果对系统进行评价。在大数据时代背景下,与传统计算机系统相比,大数据相关的计算机系统具备更高的多样性以及复杂性。因此Benchmark测试技术将涵盖广泛的应用领域并提供多样的数据类型和复杂的数据操作。本文对Benchmark测试技术中的测试规范进行了归纳总结,同时还列举了在大数据时代背景下Benchmark测试技术开发中的一些挑战以及发展趋势。
关键词(KeyWords): 基准测试;测试方法;大数据;性能
基金项目(Foundation): 国家重点研发计划(2016YFB1000304);; 973项目(2014CB340400);; 国家自然科学基金(61672061);; 中国科学院网络数据科学与技术重点实验室开放基金课题(CASNDST201707)
作者(Authors): 闫义博;朱文强;杨仝;李晓明;
参考文献(References):
- [1]Fleming P J,Wallace J J.How not to lie with statistics:the correct way to summarize benchmark results[J].Communications of the ACM,1986,29(3):218-221.
- [2]J.L.Hennessy and D.A.Patterson.Computer Architecture:A Quantitative Approach[M].San Francisco:Morgan Kaufmann,1996,1-18.
- [3]The Apache Software Foundation.Welcome to Apache Hadoop![EB/OL].(2017-12-14)/[2018-1-19].http://hadoop.apache.org/index.pdf.
- [4]Shvachko K,Kuang H,Radia S,et al.The Hadoop Distributed File System[C]//IEEE,Symposium on MASS Storage Systems and Technologies.Piscataway:IEEE,2010.
- [5]Dean J,Ghemawat S.Map Reduce:simplified data processing on large clusters[J].Commun.ACM,2008,51(1):10-10.
- [6]Wiki,Hadoop.HBase:Bigtable-like structured storage for Hadoop HDFS[EB/OL].(2012-2-23)/[2018-1-19].http://wiki.Apache.org/hadoop/Hbase.
- [7]SAHA,Bikas,et al.Apache tez:A unifying framework for modeling and building data processing applications[C]//Proceedings of the 2015 ACM SIGMOD international conference on Management of Data.New York:ACM,2015.
- [8]M.Zaharia et al.Discretized Streams:Fault-Tolerant Streaming Computation at Scale[C]//Proceeding of the 24th ACM Symposium on Operating Systems Principles.New York:ACM,2013.
- [9]Castor,Kevin.Hardware Testing and Benchmarking Methodology[EB/OL].(2008-2-5)/[2018-1-19].https://zh.scribd.com/document/181303222/Hardware-Testing-and-Benchmarking-Methodology.
- [10]Abiteboul,S.Querying semi-structured data[M].Berkeley:Apress,2016:1-18.
- [11]Gray J.Benchmark handbook:for database and transaction processing systems[M].San Francisco:Morgan Kaufmann,1992:1-22.
- [12]Seltzer M,Krinsky D,Smith K,et al.The case for application-specific benchmarking[C]//Hot Topics in Operating Systems,Piscataway:IEEE,1999.
- [13]Chen Y,Raab F,Katz R.From tpc-c to big data benchmarks:A functional workload model[M].Berlin:Springer,2014:28-43.
- [14]金澈清,钱卫宁,周敏奇等.数据管理系统评测基准:从传统数据库到新兴大数据[J].计算机学报,2015,38(1):18-34.
- [15]Aggarwal,Charu C.,and Philip S.Yu.On classification of high-cardinality data streams[C]//Proceedings of the 2010 SIAM International Conference on Data Mining.Philadelphia:Society for Industrial and Applied Mathematics,2010.
- [16]Chen,Aiyou,et al.Tracking long duration flows in network traffic[C]//Infocom,2010 proceedings ieee.Piscataway:IEEE,2010.
- [17]Cormode,Graham,and Minos Garofalakis.Sketching streams through the net:Distributed approximate query tracking[C]//Proceedings of the 31st international conference on Very large data bases.Trento:VLDB Endowment,2005.
- [18]Charikar,Moses,Kevin Chen,and Martin Farach-Colton.Finding frequent items in data streams[C]//International Colloquium on Automata,Languages,and Programming.Berlin:Springer,2002.
- [19]Liu,Zaoxing,et al.One sketch to rule them all:Rethinking network flow monitoring with univmon[C]//Proceedings of the2016 ACM SIGCOMM Conference.New York:ACM,2016.
- [20]Thomas,Dina,et al.On efficient query processing of stream counts on the cell processor[C]//International Conference on Data Engineering.Piscataway:IEEE,2009.
- [21]Cormode G.Count-Min Sketch[J].Encyclopedia of Algorithms,2009,29(1):64-69.
- [22]Cormode G,Muthukrishnan S.An Improved Data Stream Summary:The Count-Min Sketch and Its Applications[C]//Latin American Symposium on Theoretical Informatics.Berlin:Springer,2004.
- [23]Goyal A,Iii H D.Lossy Conservative Update(LCU)Sketch:Succinct Approximate Count Storage[C]//AAAI Conference on Artificial Intelligence.Palo Alto:AAAI,2012.
- [24]Roy P,Khan A,Alonso G.Augmented Sketch:Faster and More Accurate Stream Processing[C]//Proceedings of the 2016 International Conference on Management of Data.New York:ACM,2016.
- [25]Yang T,Liu A X,Shahzad M,et al.A shifting framework for set queries[J].IEEE/ACM Transactions on Networking,2017,25(5):3116-3131.
- [26]Yang T,Zhou Y,Jin H,et al.Pyramid sketch:A sketch framework for frequency estimation of data streams[J].Proceedings of the VLDB Endowment,2017,10(11):1442-1453.
- [27]Liu P,Wang H,Gao S,et al.ID Bloom Filter:Achieving Faster Multi-Set Membership Query in Network Applications[J].IEEE International Conference on Communications,2018.
- [28]Zhou Y,Liu P,Jin H,et al.One memory access sketch:a more accurate and faster sketch for per-flow measurement[C]//IEEE Globecom.Piscataway:IEEE,2017.
- [29]Gong J,Yang T,Zhou Y,et al.Abc:a practicable sketch framework for non-uniform multisets[J].IEEE Bigdata,2017.
- [30]Gilbert,Anna C.,et al.One sketch for all:fast algorithms for compressed sensing[C]//Proceedings of the thirty-ninth annual ACM symposium on Theory of computing.New York:ACM,2017.
- [31]Talbot,David,and Miles Osborne.Smoothed Bloom filter language models:Tera-scale LMs on the cheap[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.New York:ACM,2007.
- [32]Van Durme,Benjamin,and Ashwin Lall.Probabilistic Counting with Randomized Storage[C]//IJCAI.San Francisco:Morgan Kaufmann,2009.
- [33]Polyzotis,Neoklis,Minos Garofalakis,and Yannis Ioannidis.Approximate XML query answers[C]//Proceedings of the 2004ACM SIGMOD international conference on Management of data.New York:ACM,2004.
- [34]Spiegel,Joshua,and Neoklis Polyzotis.Graph-based synopses for relational selectivity estimation[C]//Proceedings of the2006 ACM SIGMOD international conference on Management of data.New York:ACM,2006.
- [35]Pietracaprina,Andrea,et al.Mining top-K frequent itemsets through progressive sampling[J].Data Mining and Knowledge Discovery,2010,21(2):310-326.
- [36]马建光,姜巍.大数据的概念、特征及其应用[J].国防科技,2013,34(2):10-17.
- [37]Burby J,Atchison S.Actionable web analytics:using data to make smart business decisions[M].New Jersey:John Wiley&Sons,2007:2-22.
- [38]王良.Benchmark性能测试综述[J].计算机工程与应用,2006,42(15):45-48.
- [39]Subramanian S,Raab F,Livingtree L,et al.Tpc Benchmark[J].Journal of Marital&Family Therapy,2003,18(1):71-81.
- [40]Ghazal A,Raab F,Raab F,et al.Big Bench:towards an industry standard benchmark for big data analytics[C]//ACM SIGMOD International Conference on Management of Data.New York:ACM,2013.
- [41]Chowdhury B,Rabl T,Saadatpanah P,et al.A Big Bench Implementation in the Hadoop Ecosystem[C]//Workshop on Big Data Benchmarks.Berlin:Springer,2013.
- [42]Henning J L.SPEC CPU2000:measuring CPU performance in the New Millennium[J].Computer,2000,33(7):28-35.
- [43]Dixit,Kaivalya M.The SPEC benchmarks[J].Parallel computing,1991,17(10-11):1195-1209.
- [44]Huang S,Huang J,Dai J,et al.The Hi Bench benchmark suite:Characterization of the Map Reduce-based data analysis[C]//IEEE,International Conference on Data Engineering Workshops.Piscataway:IEEE,2010.
- [45]Wang L,Zhang S,Zheng C,et al.Big Data Bench:A big data benchmark suite from internet services[C]//IEEE,International Symposium on High PERFORMANCE Computer Architecture.Piscataway:IEEE,2014.
- [46]Dey A,Fekete A,Nambiar R,et al.YCSB+T:Benchmarking web-scale transactional databases[C]//IEEE,International Conference on Data Engineering Workshops.Piscataway:IEEE,2014.