The purpose of calculation optimization is to save computing resources, and to improve the speed of data processing and shorten the data production cycle. The first optimization point is to avoid wasting computing power on abnormal data. Although there is no problem with the format of some data, in fact, according to the definition of the business scenario, it is abnormal and can be ignored; for example, a device is faulty, and the data generated by it is no longer involved in the calculation after it is identified.
The second optimization point is mobile number list to identify and deal with data skew. There are two cases of so-called data skew. One is that the data in a certain area is larger than other areas, and the other is that the size of some data is much larger than the average value. Further segmentation of the part with data skew can speed up the calculation. The third optimization point is to improve the performance of the core UDF. The performance of the UDF largely determines the processing time.
Through code review, find out the nodes whose performance can be optimized for code optimization; in addition, changing the UDF of Python to the UDF of Java can also improve part of the performance. The fourth optimization point is engine configuration tuning, such as enabling data compression and transmission, reasonably setting the number of map/reduce, and rationally applying the Hash Range Cluster indexing mechanism.