ASCII ,Uni Code Related Functions in Python
ASCII Code and Related Functions in Python
ASCII (American Standard Code for Information Interchange) is a character encoding standard that assigns numerical values to letters, digits, punctuation marks, and other characters. Here’s an explanation of ASCII and Python functions that work with it.
ASCII Basics
- ASCII uses 7 bits to represent characters (values 0-127)
- Standard ASCII (0-127) includes:
- Control characters (0-31 and 127)
- Printable characters (32-126): letters, numbers, punctuation
- Extended ASCII (128-255) varies by system/locale
Python Functions for ASCII
1. ord() – Get ASCII value of a character
python
char = 'A' ascii_value = ord(char) print(ascii_value) # Output: 65
2. chr() – Get character from ASCII value
python
ascii_value = 97 character = chr(ascii_value) print(character) # Output: 'a'
3. String methods that work with ASCII concepts
isascii() – Check if all characters are ASCII (Python 3.7+)
python
text = "Hello" print(text.isascii()) # Output: True
encode() – Convert string to bytes using ASCII encoding
python
text = "ABC"
bytes_data = text.encode('ascii') # b'ABC'
4. ASCII-related checks
python
# Check if character is uppercase letter
def is_uppercase(c):
return ord('A') <= ord(c) <= ord('Z')
# Check if character is lowercase letter
def is_lowercase(c):
return ord('a') <= ord(c) <= ord('z')
# Check if character is digit
def is_digit(c):
return ord('0') <= ord(c) <= ord('9')
Practical Examples
Convert string to ASCII values
python
text = "Hello" ascii_values = [ord(c) for c in text] print(ascii_values) # Output: [72, 101, 108, 108, 111]
Create string from ASCII values
python
values = [72, 101, 108, 108, 111] text = ''.join(chr(v) for v in values) print(text) # Output: "Hello"
ASCII Art Example
python
# Simple ASCII art
print("""
/\\
/ \\
/____\\
""")
Important Notes
- Python 3 uses Unicode by default, which extends beyond ASCII
- Not all Unicode characters have ASCII equivalents
- When working with files or network protocols, you may need to explicitly use ASCII encoding
Remember that while ASCII is fundamental, modern Python applications typically work with Unicode (UTF-8) to support international characters.
Unicode, Code Points, Planes, and Digraphs in Python
Unicode Overview
Unicode is a universal character encoding standard that aims to represent every character from every writing system in the world. Unlike ASCII (which only covers 128 characters), Unicode currently defines over 149,000 characters.
Key Concepts
1. Code Points
- A code point is a numerical value that represents a specific character in Unicode
- Represented as
U+followed by hexadecimal digits (e.g.,U+0041for ‘A’) - In Python, code points can be accessed using
ord():
python
print(ord('A')) # 65 (U+0041)
print(ord('€')) # 8364 (U+20AC)
print(ord('🐍')) # 128013 (U+1F40D) - Python snake emoji
2. Planes
Unicode divides its code space into 17 planes (groups of 65,536 code points each):
- Plane 0 (BMP – Basic Multilingual Plane): U+0000 to U+FFFF
- Contains most commonly used characters
- Plane 1 (SMP – Supplementary Multilingual Plane): U+10000 to U+1FFFF
- Historic scripts, musical symbols, emoji
- Plane 2 (SIP – Supplementary Ideographic Plane): U+20000 to U+2FFFF
- Rare CJK characters
- Planes 3-13: Unassigned
- Plane 14 (SSP – Supplementary Special-purpose Plane): U+E0000 to U+EFFFF
- Special-purpose characters
- Planes 15-16 (PUA – Private Use Areas): U+F0000 to U+10FFFF
- For private/custom character definitions
3. Digraphs and Combining Characters
- Digraph: A pair of characters representing one sound (like ‘ch’ in Spanish)
- Combining characters: Special code points that modify previous characters:
- Example:
'n' + '̃' = 'ñ'(U+006E + U+0303 = U+00F1)
- Example:
python
# Combining character example n = '\u006E' # 'n' tilde = '\u0303' # Combining tilde print(n + tilde) # Output: 'ñ' print(len(n + tilde)) # Length is 2 (but appears as one character)
Python Unicode Functions
1. chr() – Create character from code point
python
print(chr(65)) # 'A' print(chr(128013)) # '🐍'
2. str.encode() – Convert to bytes
python
text = "Python 🐍"
utf8_bytes = text.encode('utf-8')
print(utf8_bytes) # b'Python \xf0\x9f\x90\x8d'
3. bytes.decode() – Convert bytes to string
python
bytes_data = b'Python \xf0\x9f\x90\x8d'
text = bytes_data.decode('utf-8')
print(text) # 'Python 🐍'
4. Unicode Escape Sequences
python
# Using \u for BMP (4 hex digits)
print('\u0041') # 'A'
# Using \U for non-BMP (8 hex digits)
print('\U0001F40D') # '🐍'
# Using \N with name
print('\N{SNAKE}') # '🐍'
5. Normalization (unicodedata module)
python
from unicodedata import normalize, name
# Normalization forms: NFC, NFD, NFKC, NFKD
text = 'ñ'
print(normalize('NFC', text)) # Combines into 'ñ' (U+00F1)
print(normalize('NFD', 'ñ')) # Decomposes to 'n' + '̃'
# Get character name
print(name('A')) # 'LATIN CAPITAL LETTER A'
print(name('🐍')) # 'SNAKE'
6. String Methods
python
text = "Python3️⃣🐍"
# Check if all characters are alphanumeric
print(text.isalnum()) # True (emoji and numbers count)
# Case folding (more aggressive than lower())
print('ß'.casefold()) # 'ss'
Working with Surrogate Pairs
Characters outside BMP (above U+FFFF) are represented using surrogate pairs in UTF-16:
python
# Emoji is outside BMP (U+1F40D) snake = '🐍' # Length in code points print(len(snake)) # 1 # Actual UTF-16 representation print([hex(ord(c)) for c in snake]) # ['0xd83d', '0xdc0d'] (surrogate pair)
Practical Examples
1. Iterating over Unicode characters
python
for char in "Hello🐍":
print(f"{char}: U+{ord(char):04X}")
2. Creating custom characters
python
# Using private use area custom_char = chr(0xE000) print(custom_char) # (private use character)
3. Checking character properties
python
import unicodedata
def char_info(c):
print(f"Character: {c}")
print(f"Code point: U+{ord(c):04X}")
print(f"Name: {unicodedata.name(c)}")
print(f"Category: {unicodedata.category(c)}")
char_info('A')
char_info('🐍')
Important Notes
- Python 3 strings are Unicode by default
- The
sys.maxunicodevalue indicates if Python was built with “narrow” (UCS-2) or “wide” (UCS-4) Unicode support - When working with files, always specify encoding (preferably UTF-8)
- Some emoji are actually sequences of multiple code points (emoji + modifiers)