英文:
Parsing Underscore Seperated Key Value List
问题
我们有包含标识符和值列表的文件名,它们都由下划线分隔。起初,使用正则表达式似乎很容易解析,但问题是标识符和值都可以包含 任何内容。不过,我确实可以访问标识符列表。
以下是示例:
标识符:
{PG, PGN, T, TN, Axis}
文件名:
Measurement_2020-08-10 13.08.04.578_Batch counter_41.0_PGN_1338_TN_1337
Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_PG_under_score_program_name_T_1337
Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_Axis_unsolvable_PG_T_bla
预期结果:
{[PGN,1338],[T,1337]}
{[PG,under_score_program_name],[T,1337]}
模糊,两种可能的解决方案 {[Axis,unsolvable_PG],[T,bla]} 或 {[Axis,unsolvable],[PG,T_bla]}
正如您所看到的,我构造了一些测试某些问题值的示例。特别是最后一个示例,其中标识符实际上作为值的一部分使用...
显然必须有一种方法来解决这个问题,因为我可以看着它并迅速找出解决方法,但我只是想不出如何正确解析它。
添加了正则表达式标签,因为可能可以使用正则表达式来解决这个问题。
提前感谢您的建议
英文:
We have filenames which contain a list of their identifiers and values, both seperated by underscores. At first this seemed easy to parse with a regex, but the problem is that both identifiers and values can contain anything. I do have access to a list of the identifiers though.
Here is an example:
Identifiers:
{PG, PGN, T, TN, Axis}
Filenames:
Measurement_2020-08-10 13.08.04.578_Batch counter_41.0_PGN_1338_TN_1337
Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_PG_under_score_program_name_T_1337
Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_Axis_unsolvable_PG_T_bla
Expected results:
{[PGN,1338],[T,1337]}
{[PG,under_score_program_name],[T,1337]}
ambiguous. two possible solutions {[Axis,unsolvable_PG],[T,bla]} OR {[Axis,unsolvable],[PG,T_bla]}
As you can see i constructed some of these for testing certain problematic values. Especially the last one where an identifier is actually used as part of a value...
Obviously there must be a way to solve this, since i can look at it and figure it out quite quickly, but i just can't come up with a way to parse this correctly.
Added the regex tag because it could be possible to solve this with one.
Thank you in advance for suggestions
答案1
得分: 0
plain old StringTokenizer...
public class FilenameParser {
private Set<String> keywords;
public FilenameParser(Set<String> keywords) {
this.keywords = keywords;
}
public Map<String,String> parse(String filename) {
Map<String, String> results = new HashMap<String, String>();
StringTokenizer tokenizer = new StringTokenizer(filename, "_");
while(tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
nextToken(token, tokenizer, results);
}
return results;
}
private void nextToken(String token, StringTokenizer tokenizer, Map<String, String> results) {
if(keywords.contains(token)) {
boolean keywordFound = false;
String key = token;
List<String> value = new ArrayList<>();
while(tokenizer.hasMoreTokens() && !keywordFound ) {
token = tokenizer.nextToken();
if(keywords.contains(token)) {
keywordFound = true;
nextToken(token, tokenizer, results);
} else {
value.add(token);
}
}
results.put(key, value.stream().reduce((left, right) -> {
return left + "_" + right;
}).orElse(""));
}
return;
}
}
Usage:
public class Starter {
public static void main(String[] args) {
Set<String> keywords = new HashSet<String>();
keywords.addAll(Arrays.asList("PG", "PGN", "T", "TN", "Axis"));
FilenameParser parser = new FilenameParser(keywords);
Map<String,String> result = parser.parse("Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_Axis_unsolvable_PG_T_bla");
System.out.println(result);
}
}
Result: {T=bla, PG=, Axis=unsolvable}
英文:
plain old StringTokenizer...
public class FilenameParser {
private Set<String> keywords;
public FilenameParser(Set<String> keywords) {
this.keywords = keywords;
}
public Map<String,String> parse(String filename) {
Map<String, String> results = new HashMap<String, String>();
StringTokenizer tokenizer = new StringTokenizer(filename, "_");
while(tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
nextToken(token, tokenizer, results);
}
return results;
}
private void nextToken(String token, StringTokenizer tokenizer, Map<String, String> results) {
if(keywords.contains(token)) {
boolean keywordFound = false;
String key = token;
List<String> value = new ArrayList<>();
while(tokenizer.hasMoreTokens() && ! keywordFound ) {
token = tokenizer.nextToken();
if(keywords.contains(token)) {
keywordFound = true;
nextToken(token, tokenizer, results);
} else {
value.add(token);
}
}
results.put(key, value.stream().reduce((left, right) -> {
return left + "_" + right;
}).orElse(""));
}
return;
}
}
Usage:
public class Starter {
public static void main(String[] args) {
Set<String> keywords = new HashSet<String>();
keywords.addAll(Arrays.asList("PG", "PGN", "T", "TN", "Axis"));
FilenameParser parser = new FilenameParser(keywords);
Map<String,String> result = parser.parse("Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_Axis_unsolvable_PG_T_bla");
System.out.println(result);
}
}
Result: {T=bla, PG=, Axis=unsolvable}
专注分享java语言的经验与见解,让所有开发者获益!
评论