Parsing Underscore Separated Key Value List

huangapple 未分类评论70阅读模式
英文:

Parsing Underscore Seperated Key Value List

问题

我们有包含标识符和值列表的文件名,它们都由下划线分隔。起初,使用正则表达式似乎很容易解析,但问题是标识符和值都可以包含 任何内容。不过,我确实可以访问标识符列表。

以下是示例:

标识符:

{PG, PGN, T, TN, Axis}

文件名:

Measurement_2020-08-10 13.08.04.578_Batch counter_41.0_PGN_1338_TN_1337
Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_PG_under_score_program_name_T_1337
Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_Axis_unsolvable_PG_T_bla

预期结果:

{[PGN,1338],[T,1337]}
{[PG,under_score_program_name],[T,1337]}
模糊,两种可能的解决方案 {[Axis,unsolvable_PG],[T,bla]} 或 {[Axis,unsolvable],[PG,T_bla]}

正如您所看到的,我构造了一些测试某些问题值的示例。特别是最后一个示例,其中标识符实际上作为值的一部分使用...

显然必须有一种方法来解决这个问题,因为我可以看着它并迅速找出解决方法,但我只是想不出如何正确解析它。

添加了正则表达式标签,因为可能可以使用正则表达式来解决这个问题。

提前感谢您的建议 Parsing Underscore Separated Key Value List

英文:

We have filenames which contain a list of their identifiers and values, both seperated by underscores. At first this seemed easy to parse with a regex, but the problem is that both identifiers and values can contain anything. I do have access to a list of the identifiers though.
Here is an example:

Identifiers:

{PG, PGN, T, TN, Axis}

Filenames:

Measurement_2020-08-10 13.08.04.578_Batch counter_41.0_PGN_1338_TN_1337
Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_PG_under_score_program_name_T_1337
Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_Axis_unsolvable_PG_T_bla

Expected results:

{[PGN,1338],[T,1337]}
{[PG,under_score_program_name],[T,1337]}
ambiguous. two possible solutions {[Axis,unsolvable_PG],[T,bla]} OR {[Axis,unsolvable],[PG,T_bla]}

As you can see i constructed some of these for testing certain problematic values. Especially the last one where an identifier is actually used as part of a value...

Obviously there must be a way to solve this, since i can look at it and figure it out quite quickly, but i just can't come up with a way to parse this correctly.

Added the regex tag because it could be possible to solve this with one.

Thank you in advance for suggestions Parsing Underscore Separated Key Value List

答案1

得分: 0

plain old StringTokenizer...

public class FilenameParser {

    private Set<String> keywords;

    public FilenameParser(Set<String> keywords) {
        this.keywords = keywords;
    }

    public Map<String,String> parse(String filename) {
        Map<String, String> results = new HashMap<String, String>();

        StringTokenizer tokenizer = new StringTokenizer(filename, "_");
        while(tokenizer.hasMoreTokens()) {
            String token = tokenizer.nextToken();
            nextToken(token, tokenizer, results);
        }

        return results;
    }

    private void nextToken(String token, StringTokenizer tokenizer, Map<String, String> results) {
        if(keywords.contains(token)) {
            boolean keywordFound = false;
            String key = token;
            List<String> value = new ArrayList<>();
            while(tokenizer.hasMoreTokens() && !keywordFound ) {
                token = tokenizer.nextToken();
                if(keywords.contains(token)) {
                    keywordFound = true;

                    nextToken(token, tokenizer, results);

                } else {
                    value.add(token);
                }
            }

            results.put(key, value.stream().reduce((left, right) -> {
                return left + "_" + right;
            }).orElse(""));
        }
        return;
    }
}

Usage:

public class Starter {

    public static void main(String[] args) {

        Set<String> keywords = new HashSet<String>();
        keywords.addAll(Arrays.asList("PG", "PGN", "T", "TN", "Axis"));

        FilenameParser parser = new FilenameParser(keywords);

        Map<String,String> result = parser.parse("Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_Axis_unsolvable_PG_T_bla");
        System.out.println(result);
    }
}

Result: {T=bla, PG=, Axis=unsolvable}
英文:

plain old StringTokenizer...

public class FilenameParser {

    private Set&lt;String&gt; keywords;

    public FilenameParser(Set&lt;String&gt; keywords) {
        this.keywords = keywords;
    }

    public Map&lt;String,String&gt; parse(String filename) {
        Map&lt;String, String&gt; results = new HashMap&lt;String, String&gt;();

        StringTokenizer tokenizer = new StringTokenizer(filename, &quot;_&quot;);
        while(tokenizer.hasMoreTokens()) {
            String token = tokenizer.nextToken();
            nextToken(token, tokenizer, results);
        }

        return results;
    }

    private void nextToken(String token, StringTokenizer tokenizer, Map&lt;String, String&gt; results) {
        if(keywords.contains(token)) {
            boolean keywordFound = false;
            String key = token;
            List&lt;String&gt; value = new ArrayList&lt;&gt;();
            while(tokenizer.hasMoreTokens() &amp;&amp; ! keywordFound ) {
                token = tokenizer.nextToken();
                if(keywords.contains(token)) {
                    keywordFound = true;

                    nextToken(token, tokenizer, results);

                } else {
                    value.add(token);
                }
            }

            results.put(key, value.stream().reduce((left, right) -&gt; {
                return left + &quot;_&quot; + right;
            }).orElse(&quot;&quot;));
        }
        return;
    }
}

Usage:

public class Starter {

    public static void main(String[] args) {

        Set&lt;String&gt; keywords = new HashSet&lt;String&gt;();
        keywords.addAll(Arrays.asList(&quot;PG&quot;, &quot;PGN&quot;, &quot;T&quot;, &quot;TN&quot;, &quot;Axis&quot;));

        FilenameParser parser = new FilenameParser(keywords);

        Map&lt;String,String&gt; result = parser.parse(&quot;Measurement_2020-08-10 13.05.15.065_Batch counter_39.0_Axis_unsolvable_PG_T_bla&quot;);
        System.out.println(result);
    }
}

Result: {T=bla, PG=, Axis=unsolvable}

huangapple
  • 本文由 发表于 2020年8月14日 19:11:12
  • 转载请务必保留本文链接:https://java.coder-hub.com/63411645.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定