国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

Table of Contents
1 Unicode
2 Unicode in Python
2.1 Benefits of Unicode objects
3 Unicode對象的底層結(jié)構(gòu)體
3.1 PyASCIIObject
3.2 PyCompactUnicodeObject
3.3 PyUnicodeObject
3.4 示例
4 interned機制
Home Backend Development Python Tutorial Python built-in type str source code analysis

Python built-in type str source code analysis

May 09, 2023 pm 02:16 PM
python str

1 Unicode

The basic unit of computer storage is the byte, which is composed of 8 bits. Since English only consists of 26 letters plus a number of symbols, English characters can be stored directly in bytes. But other languages ??(such as Chinese, Japanese, Korean, etc.) have to use multiple bytes for encoding due to the large number of characters.

With the spread of computer technology, non-Latin character encoding technology continues to develop, but there are still two major limitations:

  • Does not support multiple languages: The encoding scheme of one language cannot be used for another language

  • There is no unified standard: for example, Chinese has multiple encoding standards such as GBK, GB2312, GB18030

Because the encoding methods are not uniform, developers need to convert back and forth between different encodings, and many errors will inevitably occur. In order to solve this kind of inconsistency problem, the Unicode standard was proposed. Unicode organizes and encodes most of the writing systems in the world, allowing computers to process text in a unified way. Unicode currently contains more than 140,000 characters and naturally supports multiple languages. (Unicode’s uni is the root of “unification”)

2 Unicode in Python

2.1 Benefits of Unicode objects

After Python 3, Unicode is used internally in the str object Represents, and therefore becomes a Unicode object in the source code. The advantage of using Unicode representation is that the core logic of the program uses Unicode uniformly, and only needs to be decoded and encoded at the input and output layers, which can avoid various encoding problems to the greatest extent.

The diagram is as follows:

Python built-in type str source code analysis

##2.2 Python’s optimization of Unicode

Problem: Since Unicode contains more than 140,000 characters, each A character requires at least 4 bytes to save (this is probably because 2 bytes are not enough, so 4 bytes are used, and 3 bytes are generally not used). The ASCII code for English characters requires only 1 byte. Using Unicode will quadruple the cost of frequently used English characters.

First of all, let’s take a look at the size difference of different forms of str objects in Python:

>>> sys.getsizeof('ab') - sys.getsizeof('a')
1
>>> sys.getsizeof('一二') - sys.getsizeof('一')
2
>>> sys.getsizeof('????????') - sys.getsizeof('????')
4

It can be seen that Python internally optimizes Unicode objects: according to the text content, the underlying storage unit is selected .

The underlying storage of Unicode objects is divided into three categories according to the Unicode code point range of text characters:

  • PyUnicode_1BYTE_KIND: All character code points are between U 0000 and U 00FF

  • PyUnicode_2BYTE_KIND: All character code points are between U 0000 and U FFFF, and at least one character has a code point greater than U 00FF

  • PyUnicode_1BYTE_KIND: All character code points are between U 0000 and U 10FFFF, and at least one character has a code point greater than U FFFF

  • ##The corresponding enumeration is as follows:
enum PyUnicode_Kind {
/* String contains only wstr byte characters.  This is only possible
   when the string was created with a legacy API and _PyUnicode_Ready()
   has not been called yet.  */
    PyUnicode_WCHAR_KIND = 0,
/* Return values of the PyUnicode_KIND() macro: */
    PyUnicode_1BYTE_KIND = 1,
    PyUnicode_2BYTE_KIND = 2,
    PyUnicode_4BYTE_KIND = 4
};

According to different Classification, select different storage units:

/* Py_UCS4 and Py_UCS2 are typedefs for the respective
   unicode representations. */
typedef uint32_t Py_UCS4;
typedef uint16_t Py_UCS2;
typedef uint8_t Py_UCS1;

The corresponding relationship is as follows:

Text typePyUnicode_1BYTE_KINDPyUnicode_2BYTE_KIND PyUnicode_4BYTE_KINDSince the Unicode internal storage structure varies depending on the text type, the type kind must be saved as a Unicode object public field. Python internally defines some flag bits as Unicode public fields: (Due to the author's limited level, all the fields here will not be introduced in the subsequent content. You can learn about it yourself later. Hold your fist~)
Character storage unitCharacter storage unit size (bytes)
Py_UCS11
Py_UCS22
Py_UCS44

    interned: Whether to maintain the interned mechanism
  • kind: type, used to distinguish the size of the underlying storage unit of characters
  • compact: memory allocation method, whether the object and the text buffer are separated
  • asscii: Whether the text is all pure ASCII
  • Through the PyUnicode_New function, according to the number of text characters size and the maximum character maxchar initializes the Unicode object. This function mainly selects the most compact character storage unit and underlying structure for Unicode objects based on maxchar: (The source code is relatively long, so it will not be listed here. You can understand it by yourself. It is shown in table form below)

##kindPyUnicode_1BYTE_KINDPyUnicode_1BYTE_KIND PyUnicode_2BYTE_KINDPyUnicode_4BYTE_KINDascii1000Character storage unit size (bytes) 1124Underlying structurePyASCIIObjectPyCompactUnicodeObjectPyCompactUnicodeObjectPyCompactUnicodeObject

3 Unicode對象的底層結(jié)構(gòu)體

3.1 PyASCIIObject

C源碼:

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;          /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;
        unsigned int :24;
    } state;
    wchar_t *wstr;              /* wchar_t representation (null-terminated) */
} PyASCIIObject;

源碼分析:

length:文本長度

hash:文本哈希值

state:Unicode對象標(biāo)志位

wstr:緩存C字符串的一個wchar_t指針,以“\0”結(jié)束(這里和我看的另一篇文章講得不太一樣,另一個描述是:ASCII文本緊接著位于PyASCIIObject結(jié)構(gòu)體后面,我個人覺得現(xiàn)在的這種說法比較準(zhǔn)確,畢竟源碼結(jié)構(gòu)體后面沒有別的字段了)

圖示如下:

(注意這里state字段后面有一個4字節(jié)大小的空洞,這是結(jié)構(gòu)體字段內(nèi)存對齊造成的現(xiàn)象,主要是為了優(yōu)化內(nèi)存訪問效率)

Python built-in type str source code analysis

ASCII文本由wstr指向,以’abc’和空字符串對象’'為例:

Python built-in type str source code analysis

Python built-in type str source code analysis

3.2 PyCompactUnicodeObject

如果文本不全是ASCII,Unicode對象底層便由PyCompactUnicodeObject結(jié)構(gòu)體保存。C源碼如下:

/* Non-ASCII strings allocated through PyUnicode_New use the
   PyCompactUnicodeObject structure. state.compact is set, and the data
   immediately follow the structure. */
typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the
                                 * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
    Py_ssize_t wstr_length;     /* Number of code points in wstr, possible
                                 * surrogates count as two code points. */
} PyCompactUnicodeObject;

PyCompactUnicodeObject在PyASCIIObject的基礎(chǔ)上增加了3個字段:

utf8_length:文本UTF8編碼長度

utf8:文本UTF8編碼形式,緩存以避免重復(fù)編碼運算

wstr_length:wstr的“長度”(這里所謂的長度沒有找到很準(zhǔn)確的說法,筆者也不太清楚怎么能打印出來,大家可以自行研究下)

注意到,PyASCIIObject中并沒有保存UTF8編碼形式,這是因為ASCII本身就是合法的UTF8,這也是ASCII文本底層由PyASCIIObject保存的原因。

結(jié)構(gòu)圖示:

Python built-in type str source code analysis

3.3 PyUnicodeObject

PyUnicodeObject則是Python中str對象的具體實現(xiàn)。C源碼如下:

/* Strings allocated through PyUnicode_FromUnicode(NULL, len) use the
   PyUnicodeObject structure. The actual string data is initially in the wstr
   block, and copied into the data block using _PyUnicode_Ready. */
typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;

3.4 示例

在日常開發(fā)時,要結(jié)合實際情況注意字符串拼接前后的內(nèi)存大小差別:

>>> import sys
>>> text = 'a' * 1000
>>> sys.getsizeof(text)
1049
>>> text += '????'
>>> sys.getsizeof(text)
4080

4 interned機制

如果str對象的interned標(biāo)志位為1,Python虛擬機將為其開啟interned機制,

源碼如下:(相關(guān)信息在網(wǎng)上可以看到很多說法和解釋,這里筆者能力有限,暫時沒有找到最確切的答案,之后補充。抱拳~但是我們通過分析源碼應(yīng)該是能看出一些門道的)

/* This dictionary holds all interned unicode strings.  Note that references
   to strings in this dictionary are *not* counted in the string's ob_refcnt.
   When the interned string reaches a refcnt of 0 the string deallocation
   function will delete the reference from this dictionary.
   Another way to look at this is that to say that the actual reference
   count of a string is:  s->ob_refcnt + (s->state ? 2 : 0)
*/
static PyObject *interned = NULL;
void
PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;
    PyObject *t;
#ifdef Py_DEBUG
    assert(s != NULL);
    assert(_PyUnicode_CHECK(s));
#else
    if (s == NULL || !PyUnicode_Check(s))
        return;
#endif
    /* If it's a subclass, we don't really know what putting
       it in the interned dict might do. */
    if (!PyUnicode_CheckExact(s))
        return;
    if (PyUnicode_CHECK_INTERNED(s))
        return;
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don't leave an exception */
            return;
        }
    }
    Py_ALLOW_RECURSION
    t = PyDict_SetDefault(interned, s, s);
    Py_END_ALLOW_RECURSION
    if (t == NULL) {
        PyErr_Clear();
        return;
    }
    if (t != s) {
        Py_INCREF(t);
        Py_SETREF(*p, t);
        return;
    }
    /* The two references in interned are not counted by refcnt.
       The deallocator will take care of this */
    Py_REFCNT(s) -= 2;
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}

可以看到,源碼前面還是做一些基本的檢查。我們可以看一下37行和50行:將s添加到interned字典中時,其實s同時是key和value(這里我不太清楚為什么會這樣做),所以s對應(yīng)的引用計數(shù)是+2了的(具體可以看PyDict_SetDefault()的源碼),所以在50行時會將計數(shù)-2,保證引用計數(shù)的正確。

考慮下面的場景:

>>> class User:
    def __init__(self, name, age):
        self.name = name
        self.age = age
>>> user = User('Tom', 21)
>>> user.__dict__
{'name': 'Tom', 'age': 21}

由于對象的屬性由dict保存,這意味著每個User對象都要保存一個str對象‘name’,這會浪費大量的內(nèi)存。而str是不可變對象,因此Python內(nèi)部將有潛在重復(fù)可能的字符串都做成單例模式,這就是interned機制。Python具體做法就是在內(nèi)部維護一個全局dict對象,所有開啟interned機制的str對象均保存在這里,后續(xù)需要使用的時候,先創(chuàng)建,如果判斷已經(jīng)維護了相同的字符串,就會將新創(chuàng)建的這個對象回收掉。

示例:

由不同運算生成’abc’,最后都是同一個對象:

>>> a = 'abc'
>>> b = 'ab' + 'c'
>>> id(a), id(b), a is b
(2752416949872, 2752416949872, True)

The above is the detailed content of Python built-in type str source code analysis. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What are python iterators? What are python iterators? Jul 08, 2025 am 02:56 AM

InPython,iteratorsareobjectsthatallowloopingthroughcollectionsbyimplementing__iter__()and__next__().1)Iteratorsworkviatheiteratorprotocol,using__iter__()toreturntheiteratorand__next__()toretrievethenextitemuntilStopIterationisraised.2)Aniterable(like

How to iterate over two lists at once Python How to iterate over two lists at once Python Jul 09, 2025 am 01:13 AM

A common method to traverse two lists simultaneously in Python is to use the zip() function, which will pair multiple lists in order and be the shortest; if the list length is inconsistent, you can use itertools.zip_longest() to be the longest and fill in the missing values; combined with enumerate(), you can get the index at the same time. 1.zip() is concise and practical, suitable for paired data iteration; 2.zip_longest() can fill in the default value when dealing with inconsistent lengths; 3.enumerate(zip()) can obtain indexes during traversal, meeting the needs of a variety of complex scenarios.

How to call Python from C  ? How to call Python from C ? Jul 08, 2025 am 12:40 AM

To call Python code in C, you must first initialize the interpreter, and then you can achieve interaction by executing strings, files, or calling specific functions. 1. Initialize the interpreter with Py_Initialize() and close it with Py_Finalize(); 2. Execute string code or PyRun_SimpleFile with PyRun_SimpleFile; 3. Import modules through PyImport_ImportModule, get the function through PyObject_GetAttrString, construct parameters of Py_BuildValue, call the function and process return

What is a forward reference in Python type hints for classes? What is a forward reference in Python type hints for classes? Jul 09, 2025 am 01:46 AM

ForwardreferencesinPythonallowreferencingclassesthatarenotyetdefinedbyusingquotedtypenames.TheysolvetheissueofmutualclassreferenceslikeUserandProfilewhereoneclassisnotyetdefinedwhenreferenced.Byenclosingtheclassnameinquotes(e.g.,'Profile'),Pythondela

Parsing XML data in Python Parsing XML data in Python Jul 09, 2025 am 02:28 AM

Processing XML data is common and flexible in Python. The main methods are as follows: 1. Use xml.etree.ElementTree to quickly parse simple XML, suitable for data with clear structure and low hierarchy; 2. When encountering a namespace, you need to manually add prefixes, such as using a namespace dictionary for matching; 3. For complex XML, it is recommended to use a third-party library lxml with stronger functions, which supports advanced features such as XPath2.0, and can be installed and imported through pip. Selecting the right tool is the key. Built-in modules are available for small projects, and lxml is used for complex scenarios to improve efficiency.

What is descriptor in python What is descriptor in python Jul 09, 2025 am 02:17 AM

The descriptor protocol is a mechanism used in Python to control attribute access behavior. Its core answer lies in implementing one or more of the __get__(), __set__() and __delete__() methods. 1.__get__(self,instance,owner) is used to obtain attribute value; 2.__set__(self,instance,value) is used to set attribute value; 3.__delete__(self,instance) is used to delete attribute value. The actual uses of descriptors include data verification, delayed calculation of properties, property access logging, and implementation of functions such as property and classmethod. Descriptor and pr

how to avoid long if else chains in python how to avoid long if else chains in python Jul 09, 2025 am 01:03 AM

When multiple conditional judgments are encountered, the if-elif-else chain can be simplified through dictionary mapping, match-case syntax, policy mode, early return, etc. 1. Use dictionaries to map conditions to corresponding operations to improve scalability; 2. Python 3.10 can use match-case structure to enhance readability; 3. Complex logic can be abstracted into policy patterns or function mappings, separating the main logic and branch processing; 4. Reducing nesting levels by returning in advance, making the code more concise and clear. These methods effectively improve code maintenance and flexibility.

Implementing multi-threading in Python Implementing multi-threading in Python Jul 09, 2025 am 01:11 AM

Python multithreading is suitable for I/O-intensive tasks. 1. It is suitable for scenarios such as network requests, file reading and writing, user input waiting, etc., such as multi-threaded crawlers can save request waiting time; 2. It is not suitable for computing-intensive tasks such as image processing and mathematical operations, and cannot operate in parallel due to global interpreter lock (GIL). Implementation method: You can create and start threads through the threading module, and use join() to ensure that the main thread waits for the child thread to complete, and use Lock to avoid data conflicts, but it is not recommended to enable too many threads to avoid affecting performance. In addition, the ThreadPoolExecutor of the concurrent.futures module provides a simpler usage, supports automatic management of thread pools and asynchronous acquisition

See all articles
maxchar < 128128 <= maxchar < 256256 <= maxchar < 65536 65536 <= maxchar < MAX_UNICODE
  • <sup id="ovza0"></sup>
    <li id="ovza0"></li>
    <li id="ovza0"><legend id="ovza0"></legend></li>
    <li id="ovza0"></li>
      • <span id="ovza0"><video id="ovza0"></video></span>