OpenCSV – CSVReader在保留回车符方面存在问题

huangapple 未分类评论48阅读模式
英文:

OpenCSV - CSVReader issue with keepCarriageReturn

问题

我试图读取一个逗号分隔的 CSV 文件,其内容如下:

"Row ID","StringCol","idxCol"
"INDEX","object","float64"
"Row3","carriage return 
 carriage return",0.0
"Row4","new line 
 new line",1.0
"Row5","carriage return and new line 
 carriage return and new line",2.0
"Row10","",3.0

- 所有字符串都用引号括起来
- 分隔符是逗号
- 行结束符是回车 + 换行
- 引号内的换行{\r 或 \n) 应保持原样

以下代码未能正确读取它:

CSVParser parser = new CSVParserBuilder()
        .withEscapeChar(CSVParser.DEFAULT_ESCAPE_CHARACTER)
        .withSeparator(CSVParser.DEFAULT_SEPARATOR)
        .withQuoteChar(CSVParser.DEFAULT_QUOTE_CHARACTER)
        .withStrictQuotes(false)
        .build();

File tempFile = new File("test.csv");

try (BufferedReader br = Files.newBufferedReader(tempFile.toPath(), StandardCharsets.UTF_8);
        CSVReader reader = new CSVReaderBuilder(br).withCSVParser(parser)
                .withKeepCarriageReturn(true)
                .build()) {

        for(String[] line : reader) {
            System.out.println(Arrays.toString(line));
        }

} catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}

输出将如下所示:

[Row ID, StringCol, idxCol"
]
[INDEX, object, float64"
]
[Row3, carriage return 
 carriage return, 0.0
]
[Row4, new line 
 new line, 1.0
]
[Row5, carriage return and new line 
 carriage return and new line, 2.0
]
[Row10, , 3.0
]

正如您所见,如果在行末的回车之前有一个引号,它将被保留为字符串的一部分。似乎 \r 会作为条目的一部分保留,尽管它不在引号内。这是一种奇怪的行为,因为它忽略了该条目的引号。此外,它还将最后一个引号字符保留为字符串的一部分。

基本上,我看不到在引号内保留回车的方法,但仍然能够正确读取最后一个条目(我不介意在行末删除回车符号,但我不能总是期望在引号字符之前有一个引号。
或者,我将不得不用一个正则表达式删除行末至少有回车并且可选引号字符的部分,但如果这种奇怪的行为在将来改变,我可能会遇到麻烦。
英文:

I try to read in a comma separated CSV-file which looks like this:

<pre>"Row ID","StringCol","idxCol"
"INDEX","object","float64"
"Row3","carriage return
carriage return",0.0
"Row4","new line
new line",1.0
"Row5","carriage return and new line
carriage return and new line",2.0
"Row10","",3.0</pre>

  • All Strings are quoted with "
  • separator is comma
  • Line ending is carriage return + line feed
  • line breaks {\r or \n) within quotes should be left untouched

The following code fails to read it in correctly:

CSVParser parser = new CSVParserBuilder()
		.withEscapeChar(CSVParser.DEFAULT_ESCAPE_CHARACTER)
		.withSeparator(CSVParser.DEFAULT_SEPARATOR)
		.withQuoteChar(CSVParser.DEFAULT_QUOTE_CHARACTER)
		.withStrictQuotes(false)
		.build();

File tempFile = new File(&quot;test.csv&quot;);

try (BufferedReader br = Files.newBufferedReader(tempFile.toPath(), StandardCharsets.UTF_8);
		CSVReader reader = new CSVReaderBuilder(br).withCSVParser(parser)
				.withKeepCarriageReturn(true)
				.build()) {
	
		for(String[] line : reader) {
			System.out.println(Arrays.toString(line));
		}

} catch (IOException e) {
	// TODO Auto-generated catch block
	e.printStackTrace();
}

The output would look like this:

[Row ID, StringCol, idxCol&quot;
]
[INDEX, object, float64&quot;
]
[Row3, carriage return 
 carriage return, 0.0
]
[Row4, new line 
 new line, 1.0
]
[Row5, carriage return and new line 
 carriage return and new line, 2.0
]
[Row10, , 3.0
]

As you can see, if there is a quote before the carriage return at the end of the line, it's kept as part of the string. Seems that \r is kept as part of the entry, though it's not within the quotes. Which is a weird behavior, as it ignores the quoting of that entry. Additionally it also keeps the last quote character as part of the string.

Basically, I see no way to keep carriage return within quotes but still be able to correctly read the last entry (I would not mind to remove the carriage return sign at the end of the line but I cannot always expect to have a quote character before.
Or, I would have to remove both with a regex expecting at least the carriage return with an optional quote character before at line end but I might get into trouble if this strange behavior changes in the future.

huangapple
  • 本文由 发表于 2020年5月29日 14:57:32
  • 转载请务必保留本文链接:https://java.coder-hub.com/62080334.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定