Unicode

Python 3.6.0 » Documentation » Python HOWTOs » Unicode HOWTO
https://docs.python.org/3/howto/unicode.html

parserwith open(filename, encoding=’utf-16′) as f:

>> from nltk.corpus import names

>> labeled_names = ([(name, ‘male’) for name in names.words(‘male.txt’)] + [(name, ‘female’) for name in names.words(‘female.txt’)])

>> len(labeled_names)
7944

>> for i in range(3):
print(labeled_names[i])
(u’Aamir’, ‘male’)
(u’Aaron’, ‘male’)
(u’Abbey’, ‘male’)

>> labeled_names[881][0]
u’Franz’
>>> type(labeled_names[881][0])
<type ‘unicode’>
>>> type(labeled_names[881][0].decode(“utf-8”))
<type ‘unicode’>
>>> labeled_names[881][0].encode(“ascii”)
‘Franz’

ch.isdigit() will return True if ch has either No or Nd Unicode property.
http://stackoverflow.com/questions/9480419/best-way-to-check-the-type-of-a-variable/27797640#27797640

Python 2 had unicode() function
http://www.diveintopython3.net/porting-code-to-python-3-with-2to3.html

Python 2.7 > Unicode strings
https://docs.python.org/2/tutorial/introduction.html#unicode-strings

Parsing XML from a webpage

ParseErrorParsing XML from a webpage

import urllib.request
import xml.etree.ElementTree as ET

url = ‘http://www.oxfordlearnersdictionaries.com/us/definition/english/felicity

f = urllib.request.urlopen(url)
data = f.read().decode(“utf-8”)

print(len(data))

root = ET.fromstring(data)
-> ParseError

print_line

>>from bs4 import BeautifulSoup
>>>
>>>html_tag = BeautifulSoup(data)(‘html’)[0]
bs4_element

XML instance:
https://d18ky98rnyall9.cloudfront.net/aFJF93QMEeWtlRLKY8QGgw.processed/full/360p/index.mp4

Socket Programming

socket_programmingSocket Programming HOWTO
https://docs.python.org/3.5/howto/sockets.html

18.1. socket — Low-level networking interface
https://docs.python.org/3/library/socket.html

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((‘www.py4inf.com’, 80))
s.send(b‘GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n’)
while True:
____data = s.recv(512)    #or 1024
____if (len(data) < 1):
________break
____print(data.decode(‘UTF-8’))
s.close()

==================================

21.6. urllib.request — Extensible library for opening URLs
https://docs.python.org/3.5/library/urllib.request.html

import urllib.request

url = ‘http://www.py4inf.com/code/romeo.txt
s = urllib.request.urlopen(url)
for line in s:
____print(line.decode(‘UTF-8’).strip())

————————————————————-

import urllib.request

url = ‘http://www.py4inf.com/code/romeo.txt

with urllib.request.urlopen(url) as f:
____print(f.read().decode(‘utf-8’))

from:
https://docs.python.org/3/library/urllib.request.html#examples

The with statement
https://docs.python.org/3.4/reference/compound_stmts.html#the-with-statement

—————————————————————

import urllib.request
url = ‘http://www.py4inf.com/code/romeo.txt
local_filename, headers = urllib.request.urlretrieve(url)
print(open(local_filename).read())

get_content_typePython: How to get the Content-Type of an URL?

from:
Using Python to Access Web Data
Coursera, Oct 26 — Dec 14, 2015.
https://www.coursera.org/learn/python-network-data/lecture/UxIOc/lets-write-a-web-browser

for bytes() or decode():
http://stackoverflow.com/questions/5471158/typeerror-str-does-not-support-the-buffer-interface

for more info on the “with” statement:
help(‘with’)

Python Source Code Encoding

2.2. The Interpreter and Its Environment
2.2.1. Source Code Encoding
https://docs.python.org/3/tutorial/interpreter.html
By default, Python source files are treated as encoded in UTF-8.
In that encoding, characters of most languages in the world can be used simultaneously in string literals, identifiers and comments — although the standard library only uses ASCII characters for identifiers, a convention that any portable code should follow.

PEP 0263 — Defining Python Source Code Encodings
http://legacy.python.org/dev/peps/pep-0263

Python will default to ASCII as standard encoding if no other encoding hints are given.


The default encoding was set to “ascii” in version 2.5.

ANSI format
http://stackoverflow.com/questions/701882/what-is-ansi-format

encoding123

‘\ufeffChapter’ .decode(‘utf-8-sig’)
http://stackoverflow.com/questions/17912307/u-ufeff-in-python-string