数据库读取在Spark中的执行时间

huangapple 未分类评论48阅读模式
英文:

Execution time for database read in spark

问题

我正在尝试找出以下代码的执行时间:

public Dataset<Row> loadData(SparkSession spark, String url, String query, String driver) {
    long startTime = System.nanoTime();
    Dataset<Row> readDataFrame = spark.read()
            .format("jdbc")
            .option("url", url)
            .option("dbtable", query)
            .option("driver", driver)
            .load();
    long endTime = System.currentTimeMillis();
    System.out.println((endTime - startTime) / 1000000);
    return readDataFrame;
}

上述代码给出的时间是20毫秒。现在,我添加了以下操作:

    long startTime = System.nanoTime();
    Dataset<Row> readDataFrame = spark.read()
            .format("jdbc")
            .option("url", url)
            .option("dbtable", query)
            .option("driver", driver)
            .load();
    long count = readDataFrame.count();
    long endTime = System.currentTimeMillis();
    System.out.println((endTime - startTime) / 1000000);
    return readDataFrame;

这段代码给出了2000毫秒的结果,我认为是正确的。

现在,我们的代码中已经有一个后续操作,而且不想使用persist,因为可能会出现内存问题。有没有一种好的方法来找到这个readDataFrame操作的时间?

英文:

I am trying to find the execution time for the below code

public Dataset&lt;Row&gt; loadData(SparkSession spark, String url, String query, String driver) {
        long startTime=System.nanoTime();
        Dataset&lt;Row&gt; readDataFrame= spark.read()
                .format(&quot;jdbc&quot;)
                .option(&quot;url&quot;, url)
                .option(&quot;dbtable&quot;, query)
                .option(&quot;driver&quot;, driver)
                .load();
        long endTime=System.currentTimeMillis();
        System.out.println((endTime-startTime)/1000000);
        return readDataFrame;
    }

The above code gave me time as 20 ms. Now, I added an action below like

        long startTime=System.nanoTime();
        Dataset&lt;Row&gt; readDataFrame= spark.read()
                .format(&quot;jdbc&quot;)
                .option(&quot;url&quot;, url)
                .option(&quot;dbtable&quot;, query)
                .option(&quot;driver&quot;, driver)
                .load();
        long count=readDataFrame.count();
        long endTime=System.currentTimeMillis();
        System.out.println((endTime-startTime)/1000000);
        return readDataFrame;

This code gave me 2000 ms as answer, which I suppose is correct.

Now, we already have an action later in the code and don't want to use persist since there could be memory issues. Is there a good way to find the time for this readDataFrame?

答案1

得分: 0

这是衡量 DataFrame 操作所花时间的不好方法。

Spark 在 DataFrame 操作中使用 DAG。除非对该 DataFrame 调用任何操作(例如 count),否则 Spark 将继续在后台读取 DataFrame。另一方面,转换是惰性的,即只有在被操作调用时才会执行。它们不会立即执行。

如果您在代码中稍后有一个操作,最好不要现在执行 .count()

如果您想要找出完成 DataFrame 读取所花费的时间,建议查看 Spark 作业日志。

英文:

That is a bad way to measure time taken for DataFrame operations.

Spark uses DAG for DataFrame operations. Unless you call upon any any action (eg. count) over that DataFrame, spark will continue to read into the DataFrame in the background. On the other hand, Transformations are lazy in nature i.e., they get executed only when called upon by an action. They are not executed immediately

If you have an action later in the code, it is best not to do a .count() now.

If you want to find the time taken for a DataFrame read to complete, it is advised you look into the Spark Job logs..

huangapple
  • 本文由 发表于 2020年5月5日 14:33:19
  • 转载请务必保留本文链接:https://java.coder-hub.com/61607089.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定