问题

我正在尝试找出以下代码的执行时间：

public Dataset<Row> loadData(SparkSession spark, String url, String query, String driver) {
    long startTime = System.nanoTime();
    Dataset<Row> readDataFrame = spark.read()
            .format("jdbc")
            .option("url", url)
            .option("dbtable", query)
            .option("driver", driver)
            .load();
    long endTime = System.currentTimeMillis();
    System.out.println((endTime - startTime) / 1000000);
    return readDataFrame;
}

上述代码给出的时间是20毫秒。现在，我添加了以下操作：

    long startTime = System.nanoTime();
    Dataset<Row> readDataFrame = spark.read()
            .format("jdbc")
            .option("url", url)
            .option("dbtable", query)
            .option("driver", driver)
            .load();
    long count = readDataFrame.count();
    long endTime = System.currentTimeMillis();
    System.out.println((endTime - startTime) / 1000000);
    return readDataFrame;

这段代码给出了2000毫秒的结果，我认为是正确的。

现在，我们的代码中已经有一个后续操作，而且不想使用persist，因为可能会出现内存问题。有没有一种好的方法来找到这个readDataFrame操作的时间？

英文:

I am trying to find the execution time for the below code

public Dataset&lt;Row&gt; loadData(SparkSession spark, String url, String query, String driver) {
        long startTime=System.nanoTime();
        Dataset&lt;Row&gt; readDataFrame= spark.read()
                .format(&quot;jdbc&quot;)
                .option(&quot;url&quot;, url)
                .option(&quot;dbtable&quot;, query)
                .option(&quot;driver&quot;, driver)
                .load();
        long endTime=System.currentTimeMillis();
        System.out.println((endTime-startTime)/1000000);
        return readDataFrame;
    }

The above code gave me time as 20 ms. Now, I added an action below like

        long startTime=System.nanoTime();
        Dataset&lt;Row&gt; readDataFrame= spark.read()
                .format(&quot;jdbc&quot;)
                .option(&quot;url&quot;, url)
                .option(&quot;dbtable&quot;, query)
                .option(&quot;driver&quot;, driver)
                .load();
        long count=readDataFrame.count();
        long endTime=System.currentTimeMillis();
        System.out.println((endTime-startTime)/1000000);
        return readDataFrame;

This code gave me 2000 ms as answer, which I suppose is correct.

Now, we already have an action later in the code and don't want to use persist since there could be memory issues. Is there a good way to find the time for this readDataFrame?

答案1

得分: 0

这是衡量 DataFrame 操作所花时间的不好方法。

Spark 在 DataFrame 操作中使用 DAG。除非对该 DataFrame 调用任何操作（例如 count），否则 Spark 将继续在后台读取 DataFrame。另一方面，转换是惰性的，即只有在被操作调用时才会执行。它们不会立即执行。

如果您在代码中稍后有一个操作，最好不要现在执行 .count()。

如果您想要找出完成 DataFrame 读取所花费的时间，建议查看 Spark 作业日志。

英文:

That is a bad way to measure time taken for DataFrame operations.

Spark uses DAG for DataFrame operations. Unless you call upon any any action (eg. count) over that DataFrame, spark will continue to read into the DataFrame in the background. On the other hand, Transformations are lazy in nature i.e., they get executed only when called upon by an action. They are not executed immediately

If you have an action later in the code, it is best not to do a .count() now.

If you want to find the time taken for a DataFrame read to complete, it is advised you look into the Spark Job logs..

专注分享java语言的经验与见解，让所有开发者获益！

数据库读取在Spark中的执行时间

问题

答案1

Go like channels in Java

在低资源环境下使用Apache Cassandra和Go服务器

avatica-go客户端读取Phoenix查询服务器：[驱动程序：连接错误]

向Spring端点发送POST请求，返回状态码400。

Spring Boot控制器从Golang应用程序接收到的重定向请求会被重复执行两次。

可以在不将其读入内存的情况下多次重用HTTP请求体吗？

How to register my go lang microservice in Spring Eureka Service Discovery

在应用程序-go + BDD-java中模拟第三方服务

What is value, reference vs pointer and what these three example used to pass?

Do goroutines and light-weight Java threads mean we never need use thread pools and async code again?

发表评论