Java的BitSet为什么在内部存储长整型数组,但在set方法中使用整型?

huangapple 未分类评论53阅读模式
标题翻译

Why does Java's BitSet internally store long array but use int for set method?

问题

根据BitSet的实现,它在内部使用一个long型数组:

/**
 * 与serialField“bits”对应的内部字段。
 */
private long[] words;

但是对于set方法,它使用int类型:

public void set(int bitIndex) {...}

因此基本上我们可以存储(2^31 - 1) * 64 * 8 = 2,147,483,642 * 64 * 8 = 137,438,953,088个位,但是使用int索引,我们只能访问前2,147,483,648个位。

这意味着有137,438,953,088 - 2,147,483,648 = 135,291,469,440个位是不可用的。

但是,如果这个类的开发者在位索引中使用了long而不是int,那么所有的问题都将得到解决,因为使用long,我们可以遍历2^63 - 1 = 9,223,372,036,854,775,807个位。

即使从性能的角度来看,这也没有任何意义。

使用int而不是long进行索引和丢失数十亿个位的逻辑背后的推理是什么?

附注:有人可能会说问题是2 GiB的堆大小,但是如今这不再是一个问题。

英文翻译

According to the BitSet implementation, it internally uses an array of longs:

/**
 * The internal field corresponding to the serialField "bits".
 */
private long[] words;

But for the set method it uses int:

public void set(int bitIndex) {...}

So basically we can store (2^31 - 1) * 64 * 8 = 2,147,483,642 * 64 * 8 = 137,438,953,088 bits, but using int indexing we have access only to the first 2,147,483,648 bits.

Which means that 137,438,953,088 - 2,147,483,648 = 135,291,469,440 bits are unavailable.

But if developers of this class used long instead of int for bits indexing, it would solve all the problems, since with long we can navigate trough 2^63 - 1 = 9,223,372,036,854,775,807 bits

It does not make any sense even from performance point of view.

What the reasoning behind the logic of using int instead of long for indexing and missing billions of bits?

P.S. One can say that the problem is 2 GiB of heap size, but today it is not an issue anymore.

答案1

得分: 0

java.util.BitSet 的文档说明:

BitSet 的位是由非负整数索引的。

这是它的预期行为,所以不需要使用 long 索引。

关于它的内部数据结构是否支持超过 2^31 个个别位,这是一个与类的公共接口无关的实现细节(它们本可以使用 boolean[] 数组,尽管会占用更多内存和某些方法运行时更多时间,但类仍然可以正常工作)。

问题仍然存在:这个类的公共接口是否会更改以支持 long 索引?

这是极不可能的,因为支持 long 索引意味着方法如下也需要更改:

  • int cardinality()
  • int nextClearBit()(以及类似的方法:下一个/上一个清除/设置位)
  • int size()
  • IntStream stream()

这将破坏现有的代码。

我唯一能想到的具有 long 索引的类似于 BitSet 的类是一个额外的类 BigBitSet(或 LongBitSet 或您喜欢的任何名称),这样需要超过 2^31 位的人可以切换到新类。

是否会将这样的类添加到 java.util 包中是另一个问题 - 为此,您需要说服 JCP 执行委员会,这是当前 Java 生态系统中的一个重要补充/重大缺陷。

英文翻译

The documentation of java.util.BitSet states:

> The bits of a BitSet are indexed by nonnegative integers.

This is what it is supposed to do, so no long indexes needed.

That it's internal data structure could support more than 2^31 individual bits is an implementation detail that has no relevance for the public interface of the class (they could have used a boolean[] array and the class would still work, albeit with a bigger memory footprint and more runtime for some methods.)


The question remains: will the public interface of this class change to support long indexes?

This is highly unlikely, because supporting long indexes would mean that methods like

  • int cardinality()
  • int nextClearBit() (and similar methods: next/previous clear/set bit)
  • int size()
  • IntStream stream()

would also need to be changed, which would break existing code.

The only way I can think of a BitSet like class with long indexes would be an additional class BigBitSet (or LongBitSet or whatever you like) so that people needing bitsets with more then 2^31 bits could switch to that new class.

Whether such a class would ever be added to the java.util package is another question - for that you would have to convince the JCP executive board that this is a important addition / gaping hole in the current Java ecosystem.

答案2

得分: -1

每个64位的块被打包成长整型,而不是每个位索引一个长整型,因此在调用set(2147483647)时,长整型数组long[]的长度将使用高达268,435,456字节,如果只调用bitset.set(1),则只使用一个长整型。示例在jshell中:

BitSet b = new BitSet();
b.size();
==> 64即words的长度为1可以存储64位
b.set(1);
b.size();
==> 64即words仍然是长度1
b.set(64);
==> 128即words数组的长度为2可以存储高达128位
英文翻译

Each chunk of 64 bits is packed into long, not one long per bit index, so length of the long[] words array will use up to 268,435,456 bytes with int index when calling set(2147483647) or just one long if calling only bitset.set(1). Example in jshell:

BitSet b = new BitSet();
b.size();
==> 64 (ie words is length 1 can store 64 bits)
b.set(1);
b.size();
==> 64 (ie words is still length 1)
b.set(64)
==> 128 (ie words array is length 2, can store up to 128 bits)

答案3

得分: -2

通常,您使用位集来索引其他内容。假设您使用这个位集来索引一个数组。

BitSet b = new BitSet();
b.set(2147483647);
ArrayList<X> items = new ArrayList<X>();
// ...向ArrayList添加大量元素...
// 然后:
X item = items.get(b.nextSetBit(0));

为了使这个工作正常,数组列表必须包含 2,147,483,648 个元素,至少需要使用 2GB 的内存(假设每个元素至少需要 1 字节的存储),这可能会导致 Java 崩溃。

英文翻译

Usually you use bit sets to index into something else. Let’s say you use this bitset to index into an array.

BitSet b = new BitSet();
b.set(2147483647);
ArrayList&lt;X&gt; items = new ArrayList&lt;X&gt;();
// ...add a looot of elements to the ArrayList...
// then:
X item = items.get(b.nextSetBit(0));

To make this work, the array list must contain 2,147,483,648 elements, and it would at least use 2GB of RAM (assuming each element requires at least 1 byte of storage), which would crash Java.

huangapple
  • 本文由 发表于 2020年5月30日 19:02:52
  • 转载请务必保留本文链接:https://java.coder-hub.com/62101485.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定