+.. _bered_ctx:
+
+BER encoding
+------------
+
+By default PyDERASN accepts only DER encoded data. By default it encodes
+to DER. But you can optionally enable BER decoding with setting
+``bered`` :ref:`context <ctx>` argument to True. Indefinite lengths and
+constructed primitive types should be parsed successfully.
+
+* If object is encoded in BER form (not the DER one), then ``ber_encoded``
+ attribute is set to True. Only ``BOOLEAN``, ``BIT STRING``, ``OCTET
+ STRING``, ``OBJECT IDENTIFIER``, ``SEQUENCE``, ``SET``, ``SET OF``,
+ ``UTCTime``, ``GeneralizedTime`` can contain it.
+* If object has an indefinite length encoding, then its ``lenindef``
+ attribute is set to True. Only ``BIT STRING``, ``OCTET STRING``,
+ ``SEQUENCE``, ``SET``, ``SEQUENCE OF``, ``SET OF``, ``ANY`` can
+ contain it.
+* If object has an indefinite length encoded explicit tag, then
+ ``expl_lenindef`` is set to True.
+* If object has either any of BER-related encoding (explicit tag
+ indefinite length, object's indefinite length, BER-encoding) or any
+ underlying component has that kind of encoding, then ``bered``
+ attribute is set to True. For example SignedData CMS can have
+ ``ContentInfo:content:signerInfos:*`` ``bered`` value set to True, but
+ ``ContentInfo:content:signerInfos:*:signedAttrs`` won't.
+
+EOC (end-of-contents) token's length is taken in advance in object's
+value length.
+
+.. _allow_expl_oob_ctx:
+
+Allow explicit tag out-of-bound
+-------------------------------
+
+Invalid BER encoding could contain ``EXPLICIT`` tag containing more than
+one value, more than one object. If you set ``allow_expl_oob`` context
+option to True, then no error will be raised and that invalid encoding
+will be silently further processed. But pay attention that offsets and
+lengths will be invalid in that case.
+
+.. warning::
+
+ This option should be used only for skipping some decode errors, just
+ to see the decoded structure somehow.
+
+.. _streaming:
+
+Streaming and dealing with huge structures
+------------------------------------------
+
+.. _evgen_mode:
+
+evgen mode
+__________
+
+ASN.1 structures can be huge, they can hold millions of objects inside
+(for example Certificate Revocation Lists (CRL), holding revocation
+state for every previously issued X.509 certificate). CACert.org's 8 MiB
+CRL file takes more than half a gigabyte of memory to hold the decoded
+structure.
+
+If you just simply want to check the signature over the ``tbsCertList``,
+you can create specialized schema with that field represented as
+OctetString for example::
+
+ class TBSCertListFast(Sequence):
+ schema = (
+ [...]
+ ("revokedCertificates", OctetString(
+ impl=SequenceOf.tag_default,
+ optional=True,
+ )),
+ [...]
+ )
+
+This allows you to quickly decode a few fields and check the signature
+over the ``tbsCertList`` bytes.
+
+But how can you get all certificate's serial number from it, after you
+trust that CRL after signature validation? You can use so called
+``evgen`` (event generation) mode, to catch the events/facts of some
+successful object decoding. Let's use command line capabilities::
+
+ $ python -m pyderasn --schema tests.test_crl:CertificateList --evgen revoke.crl
+ 10 [1,1, 1] . . version: Version INTEGER v2 (01) OPTIONAL
+ 15 [1,1, 9] . . . algorithm: OBJECT IDENTIFIER 1.2.840.113549.1.1.13
+ 26 [0,0, 2] . . . parameters: [UNIV 5] ANY OPTIONAL
+ 13 [1,1, 13] . . signature: AlgorithmIdentifier SEQUENCE
+ 34 [1,1, 3] . . . . . . type: AttributeType OBJECT IDENTIFIER 2.5.4.10
+ 39 [0,0, 9] . . . . . . value: [UNIV 19] AttributeValue ANY
+ 32 [1,1, 14] . . . . . 0: AttributeTypeAndValue SEQUENCE
+ 30 [1,1, 16] . . . . 0: RelativeDistinguishedName SET OF
+ [...]
+ 188 [1,1, 1] . . . . userCertificate: CertificateSerialNumber INTEGER 17 (11)
+ 191 [1,1, 13] . . . . . utcTime: UTCTime UTCTime 2003-04-01T14:25:08
+ 191 [0,0, 15] . . . . revocationDate: Time CHOICE utcTime
+ 191 [1,1, 13] . . . . . utcTime: UTCTime UTCTime 2003-04-01T14:25:08
+ 186 [1,1, 18] . . . 0: RevokedCertificate SEQUENCE
+ 208 [1,1, 1] . . . . userCertificate: CertificateSerialNumber INTEGER 20 (14)
+ 211 [1,1, 13] . . . . . utcTime: UTCTime UTCTime 2002-10-01T02:18:01
+ 211 [0,0, 15] . . . . revocationDate: Time CHOICE utcTime
+ 211 [1,1, 13] . . . . . utcTime: UTCTime UTCTime 2002-10-01T02:18:01
+ 206 [1,1, 18] . . . 1: RevokedCertificate SEQUENCE
+ [...]
+ 9144992 [0,0, 15] . . . . revocationDate: Time CHOICE utcTime
+ 9144992 [1,1, 13] . . . . . utcTime: UTCTime UTCTime 2020-02-08T07:25:06
+ 9144985 [1,1, 20] . . . 415755: RevokedCertificate SEQUENCE
+ 181 [1,4,9144821] . . revokedCertificates: RevokedCertificates SEQUENCE OF OPTIONAL
+ 5 [1,4,9144997] . tbsCertList: TBSCertList SEQUENCE
+ 9145009 [1,1, 9] . . algorithm: OBJECT IDENTIFIER 1.2.840.113549.1.1.13
+ 9145020 [0,0, 2] . . parameters: [UNIV 5] ANY OPTIONAL
+ 9145007 [1,1, 13] . signatureAlgorithm: AlgorithmIdentifier SEQUENCE
+ 9145022 [1,3, 513] . signatureValue: BIT STRING 4096 bits
+ 0 [1,4,9145534] CertificateList SEQUENCE
+
+Here we see how decoder works: it decodes SEQUENCE's tag, length, then
+decodes underlying values. It can not tell if SEQUENCE is decoded, so
+the event of the upper level SEQUENCE is the last one we see.
+``version`` field is just a single INTEGER -- it is decoded and event is
+fired immediately. Then we see that ``algorithm`` and ``parameters``
+fields are decoded and only after them the ``signature`` SEQUENCE is
+fired as a successfully decoded. There are 4 events for each revoked
+certificate entry in that CRL: ``userCertificate`` serial number,
+``utcTime`` of ``revocationDate`` CHOICE, ``RevokedCertificate`` itself
+as a one of entity in ``revokedCertificates`` SEQUENCE OF.
+
+We can do that in our ordinary Python code and understand where we are
+by looking at deterministically generated decode paths (do not forget
+about useful ``--print-decode-path`` CLI option). We must use
+:py:meth:`pyderasn.Obj.decode_evgen` method, instead of ordinary
+:py:meth:`pyderasn.Obj.decode`. It is generator yielding ``(decode_path,
+obj, tail)`` tuples::
+
+ for decode_path, obj, _ in CertificateList().decode_evgen(crl_raw):
+ if (
+ len(decode_path) == 4 and
+ decode_path[:2] == ("tbsCertList", "revokedCertificates"),
+ decode_path[3] == "userCertificate"
+ ):
+ print("serial number:", int(obj))
+
+Virtually it does not take any memory except at least needed for single
+object storage. You can easily use that mode to determine required
+object ``.offset`` and ``.*len`` to be able to decode it separately, or
+maybe verify signature upon it just by taking bytes by ``.offset`` and
+``.tlvlen``.
+
+.. _evgen_mode_upto_ctx:
+
+evgen_mode_upto
+_______________
+
+There is full ability to get any kind of data from the CRL in the
+example above. However it is not too convenient to get the whole
+``RevokedCertificate`` structure, that is pretty lightweight and one may
+do not want to disassemble it. You can use ``evgen_mode_upto``
+:ref:`ctx <ctx>` option that semantically equals to
+:ref:`defines_by_path <defines_by_path_ctx>` -- list of decode paths
+mapped to any non-None value. If specified decode path is met, then any
+subsequent objects won't be decoded in evgen mode. That allows us to
+parse the CRL above with fully assembled ``RevokedCertificate``::
+
+ for decode_path, obj, _ in CertificateList().decode_evgen(
+ crl_raw,
+ ctx={"evgen_mode_upto": (
+ (("tbsCertList", "revokedCertificates", any), True),
+ )},
+ ):
+ if (
+ len(decode_path) == 3 and
+ decode_path[:2] == ("tbsCertList", "revokedCertificates"),
+ ):
+ print("serial number:", int(obj["userCertificate"]))
+
+.. note::
+
+ SEQUENCE/SET values with DEFAULT specified are automatically decoded
+ without evgen mode.
+
+.. _mmap:
+
+mmap-ed file
+____________
+
+POSIX compliant systems have ``mmap`` syscall, giving ability to work
+the memory mapped file. You can deal with the file like it was an
+ordinary binary string, allowing you not to load it to the memory first.
+Also you can use them as an input for OCTET STRING, taking no Python
+memory for their storage.
+
+There is convenient :py:func:`pyderasn.file_mmaped` function that
+creates read-only memoryview on the file contents::
+
+ with open("huge", "rb") as fd:
+ raw = file_mmaped(fd)
+ obj = Something.decode(raw)
+
+.. warning::
+
+ mmap maps the **whole** file. So it plays no role if you seek-ed it
+ before. Take the slice of the resulting memoryview with required
+ offset instead.
+
+.. note::
+
+ If you use ZFS as underlying storage, then pay attention that
+ currently most platforms does not deal good with ZFS ARC and ordinary
+ page cache used for mmaps. It can take twice the necessary size in
+ the memory: both in page cache and ZFS ARC.
+
+CER encoding
+____________
+
+We can parse any kind of data now, but how can we produce files
+streamingly, without storing their encoded representation in memory?
+SEQUENCE by default encodes in memory all its values, joins them in huge
+binary string, just to know the exact size of SEQUENCE's value for
+encoding it in TLV. DER requires you to know all exact sizes of the
+objects.
+
+You can use CER encoding mode, that slightly differs from the DER, but
+does not require exact sizes knowledge, allowing streaming encoding
+directly to some writer/buffer. Just use
+:py:meth:`pyderasn.Obj.encode_cer` method, providing the writer where
+encoded data will flow::
+
+ with open("result", "wb") as fd:
+ obj.encode_cer(fd.write)
+
+::
+
+ buf = io.BytesIO()
+ obj.encode_cer(buf.write)
+
+If you do not want to create in-memory buffer every time, then you can
+use :py:func:`pyderasn.encode_cer` function::
+
+ data = encode_cer(obj)
+
+Remember that CER is **not valid** DER in most cases, so you **have to**
+use :ref:`bered <bered_ctx>` :ref:`ctx <ctx>` option during its
+decoding. Also currently there is **no** validation that provided CER is
+valid one -- you are sure that it has only valid BER encoding.
+
+.. warning::
+
+ SET OF values can not be streamingly encoded, because they are
+ required to be sorted byte-by-byte. Big SET OF values still will take
+ much memory. Use neither SET nor SET OF values, as modern ASN.1
+ also recommends too.
+
+Do not forget about using :ref:`mmap-ed <mmap>` memoryviews for your
+OCTET STRINGs! They will be streamingly copied from underlying file to
+the buffer using 1 KB chunks.
+
+Some structures require that some of the elements have to be forcefully
+DER encoded. For example ``SignedData`` CMS requires you to encode
+``SignedAttributes`` and X.509 certificates in DER form, allowing you to
+encode everything else in BER. You can tell any of the structures to be
+forcefully encoded in DER during CER encoding, by specifying
+``der_forced=True`` attribute::
+
+ class Certificate(Sequence):
+ schema = (...)
+ der_forced = True
+
+ class SignedAttributes(SetOf):
+ schema = Attribute()
+ bounds = (1, float("+inf"))
+ der_forced = True
+
+.. _agg_octet_string:
+
+agg_octet_string
+________________
+
+In most cases, huge quantity of binary data is stored as OCTET STRING.
+CER encoding splits it on 1 KB chunks. BER allows splitting on various
+levels of chunks inclusion::
+
+ SOME STRING[CONSTRUCTED]
+ OCTET STRING[CONSTRUCTED]
+ OCTET STRING[PRIMITIVE]
+ DATA CHUNK
+ OCTET STRING[PRIMITIVE]
+ DATA CHUNK
+ OCTET STRING[PRIMITIVE]
+ DATA CHUNK
+ OCTET STRING[PRIMITIVE]
+ DATA CHUNK
+ OCTET STRING[CONSTRUCTED]
+ OCTET STRING[PRIMITIVE]
+ DATA CHUNK
+ OCTET STRING[PRIMITIVE]
+ DATA CHUNK
+ OCTET STRING[CONSTRUCTED]
+ OCTET STRING[CONSTRUCTED]
+ OCTET STRING[PRIMITIVE]
+ DATA CHUNK
+
+You can not just take the offset and some ``.vlen`` of the STRING and
+treat it as the payload. If you decode it without
+:ref:`evgen mode <evgen_mode>`, then it will be automatically aggregated
+and ``bytes()`` will give the whole payload contents.
+
+You are forced to use :ref:`evgen mode <evgen_mode>` for decoding for
+small memory footprint. There is convenient
+:py:func:`pyderasn.agg_octet_string` helper for reconstructing the
+payload. Let's assume you have got BER/CER encoded ``ContentInfo`` with
+huge ``SignedData`` and ``EncapsulatedContentInfo``. Let's calculate the
+SHA512 digest of its ``eContent``::
+
+ fd = open("data.p7m", "rb")
+ raw = file_mmaped(fd)
+ ctx = {"bered": True}
+ for decode_path, obj, _ in ContentInfo().decode_evgen(raw, ctx=ctx):
+ if decode_path == ("content",):
+ content = obj
+ break
+ else:
+ raise ValueError("no content found")
+ hasher_state = sha512()
+ def hasher(data):
+ hasher_state.update(data)
+ return len(data)
+ evgens = SignedData().decode_evgen(
+ raw[content.offset:],
+ offset=content.offset,
+ ctx=ctx,
+ )
+ agg_octet_string(evgens, ("encapContentInfo", "eContent"), raw, hasher)
+ fd.close()
+ digest = hasher_state.digest()
+
+Simply replace ``hasher`` with some writeable file's ``fd.write`` to
+copy the payload (without BER/CER encoding interleaved overhead) in it.
+Virtually it won't take memory more than for keeping small structures
+and 1 KB binary chunks.
+
+.. _seqof-iterators:
+
+SEQUENCE OF iterators
+_____________________
+
+You can use iterators as a value in :py:class:`pyderasn.SequenceOf`
+classes. The only difference with providing the full list of objects, is
+that type and bounds checking is done during encoding process. Also
+sequence's value will be emptied after encoding, forcing you to set its
+value again.
+
+This is very useful when you have to create some huge objects, like
+CRLs, with thousands and millions of entities inside. You can write the
+generator taking necessary data from the database and giving the
+``RevokedCertificate`` objects. Only binary representation of that
+objects will take memory during DER encoding.
+
+2-pass DER encoding
+-------------------
+
+There is ability to do 2-pass encoding to DER, writing results directly
+to specified writer (buffer, file, whatever). It could be 1.5+ times
+slower than ordinary encoding, but it takes little memory for 1st pass
+state storing. For example, 1st pass state for CACert.org's CRL with
+~416K of certificate entries takes nearly 3.5 MB of memory.
+``SignedData`` with several gigabyte ``EncapsulatedContentInfo`` takes
+nearly 0.5 KB of memory.
+
+If you use :ref:`mmap-ed <mmap>` memoryviews, :ref:`SEQUENCE OF
+iterators <seqof-iterators>` and write directly to opened file, then
+there is very small memory footprint.
+
+1st pass traverses through all the objects of the structure and returns
+the size of DER encoded structure, together with 1st pass state object.
+That state contains precalculated lengths for various objects inside the
+structure.
+
+::
+
+ fulllen, state = obj.encode1st()
+
+2nd pass takes the writer and 1st pass state. It traverses through all
+the objects again, but writes their encoded representation to the writer.
+
+::
+
+ with open("result", "wb") as fd:
+ obj.encode2nd(fd.write, iter(state))
+
+.. warning::
+
+ You **MUST NOT** use 1st pass state if anything is changed in the
+ objects. It is intended to be used immediately after 1st pass is
+ done!
+
+If you use :ref:`SEQUENCE OF iterators <seqof-iterators>`, then you
+have to reinitialize the values after the 1st pass. And you **have to**
+be sure that the iterator gives exactly the same values as previously.
+Yes, you have to run your iterator twice -- because this is two pass
+encoding mode.
+
+If you want to encode to the memory, then you can use convenient
+:py:func:`pyderasn.encode2pass` helper.
+
+.. _browser:
+
+ASN.1 browser
+-------------
+.. autofunction:: pyderasn.browse
+
+Base Obj
+--------
+.. autoclass:: pyderasn.Obj
+ :members:
+