所谓“伪娘”,即通过女装、化妆等手法让外人认为是女性的男性,我们通常可以在各地的漫展上看到相当数量的“伪娘”,这也是二次元文化中的萌属性之一。
而“药娘”则不同,简单来说就是心理性别为女,生理性别为男的跨性别者,他们通常是依靠激素药物改变内分泌,从而让自己身体特征逐渐接近女性。这个群体的人数非常稀少且又特殊,直至去年(2016年?根据参考链接文章编辑时间推测。)才在网络上出现相关讨论,但目前并没有引起社会的广泛关注。
BeautifulSoup抓取js变量
页面代码:
< div class="myplayer" >
< div class="m1938" >
< script type="text/javascript" >var player_data={"flag":"play","encrypt":0,"trysee":0,"points":0,"link":"\/index.php\/vod\/play\/id\/9221\/sid\/1\/nid\/1.html","link_next":"","link_pre":"","url":"https:\/\/lbbf9.com\/20200325\/WX8h2pjI\/index.m3u8","url_next":"","from":"lbm3u8","server":"no","note":""}< /script > < script type="text/javascript" src="/static/js/playerconfig.js?t=20200913" >< /script >< script type="text/javascript" src="/static/js/player.js?t=20200913" >< /script >
< style >.MacPlayer{background: #000000;font-size:14px;color:#F6F6F6;margin:0px;padding:0px;position:relative;overflow:hidden;width:100%;height:100%;min-height:100px;}.MacPlayer table{width:100%;height:100%;}.MacPlayer #playleft{position:inherit;!important;width:100%;height:100%;}< /style >
< div class="MacPlayer" >< iframe id="buffer" src="" frameborder="0" scrolling="no" width="100%" height="100%" style="position: absolute; z-index: 99998; display: none;" >< /iframe >< iframe id="install" src="" frameborder="0" scrolling="no" width="100%" height="100%" style="position:absolute;z-index:99998;display:none;" >< /iframe >
< table border="0" cellpadding="0" cellspacing="0" >
< tbody >
< tr >
< td id="playleft" valign="top" style="" >< iframe width="100%" height="100%" src="/static/player/dplayer.html" frameborder="0" allowfullscreen="true" border="0" marginwidth="0" marginheight="0" scrolling="no" >< /iframe >< /td >
< /tr >
< /tbody >
< /table >
< /div >
< script src="/static/player/lbm3u8.js?v=0.5806522403562584" >< /script >< /div >
< /div >
Python代码:
from bs4 import BeautifulSoup as bs
import re
import json
import requests
def get_m3u8_link(url):
# 直接正则匹配
print('_' * 70)
print('[A] 解析播放地址......')
html_doc = get_url_source_code(url)
bs = BeautifulSoup(html_doc, "html.parser")
pattern = re.compile(r"var cms_player = {(.*?);$", re.MULTILINE | re.DOTALL)
surls = bs.find('script', text=pattern)
js_string = str(surls.text).replace('var cms_player = ', '').replace(';', '')
json_data = json.loads(js_string)
m3u8_link = json_data['url']
title = bs.title.string
print('[A] 标题:' + title)
print('[A] 播放地址:' + m3u8_link)
print('_' * 70)
return m3u8_link, title
Porn Data Anaylize — AI换脸 分类数据浅析(github)
声明:本文中所有数据都是来源于第三方福利网站的数据,本文仅对数据中相关的信息进行解析。本人非常喜欢这些女明星,绝无抹黑之意。
from pyspark.sql.functions import col
import altair as alt
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
csv = spark.read.option("header",True).csv("hdfs://localhost:9000/data2/porn_data_movie.csv")
csv.printSchema()
root |-- id: string (nullable = true) |-- create: string (nullable = true) |-- update: string (nullable = true) |-- name: string (nullable = true) |-- describe: string (nullable = true) |-- source_id: string (nullable = true) |-- publish_time: string (nullable = true) |-- play_count: string (nullable = true) |-- good_count: string (nullable = true) |-- bad_count: string (nullable = true) |-- link_count: string (nullable = true) |-- comment_count: string (nullable = true) |-- designation: string (nullable = true) |-- category_id: string (nullable = true) |-- porn_site_id: string (nullable = true) |-- uploader_id: string (nullable = true) |-- producer: string (nullable = true)
Porn Data Anaylize — 上传者 分类信息分析(github)
'''
视频作者 视频分类信息分析
http://www.h4ck.org.cn
by obaby
obaby@mars
email:root@obaby.org.cn
date: 2020.09.04
'''
from pyspark.sql.functions import col
import altair as alt
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
csv = spark.read.option("header",True).csv("hdfs://localhost:9000/data2/porn_data_movie.csv")
csv.printSchema()
root |-- id: string (nullable = true) |-- create: string (nullable = true) |-- update: string (nullable = true) |-- name: string (nullable = true) |-- describe: string (nullable = true) |-- source_id: string (nullable = true) |-- publish_time: string (nullable = true) |-- play_count: string (nullable = true) |-- good_count: string (nullable = true) |-- bad_count: string (nullable = true) |-- link_count: string (nullable = true) |-- comment_count: string (nullable = true) |-- designation: string (nullable = true) |-- category_id: string (nullable = true) |-- porn_site_id: string (nullable = true) |-- uploader_id: string (nullable = true) |-- producer: string (nullable = true)
csv.select('name', 'describe', 'uploader_id').show()
Porn Data Anaylize — 标签 模特信息分析(github)
from pyspark.sql.functions import col
import altair as alt
import pandas as pd
from matplotlib import pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
csv = spark.read.option("header",True).csv("hdfs://localhost:9000/data2/porn_data_movie_tags.csv")
tag_csv = spark.read.option("header",True).csv("hdfs://localhost:9000/data2/porn_data_tag.csv")
csv.show()
+---+--------+------+ | id|movie_id|tag_id| +---+--------+------+ | 1| 9909| 1| | 2| 9909| 2| | 3| 9909| 3| | 4| 9909| 4| | 5| 9910| 5| | 6| 9910| 6| | 7| 9910| 7| | 8| 9910| 8| | 9| 9910| 9| | 10| 9910| 10| | 11| 9911| 12| | 12| 9911| 2| | 13| 9911| 1| | 14| 9911| 13| | 15| 9910| 11| | 16| 9911| 14| | 17| 9911| 15| | 18| 9911| 5| | 19| 9910| 16| | 20| 9910| 17| +---+--------+------+ only showing top 20 rows
Porn Data Anaylize — 视频数据初探
'''
--------------------------------------------------------------------------------
福利数据解析
基础数据分析,标题分词,词频统计
-----------------------------------
by:obaby
email: root@obaby.org.cn
blog:http://www.h4ck.org.cn
===================================
参考链接:https://sparkbyexamples.com/pyspark/select-columns-from-pyspark-dataframe/
-------------------------------------------------------------------------------
'''
import jieba
# 通过spark read csv格式文件,从csv header解析数据结构
csv = spark.read.option("header",True).csv("hdfs://localhost:9000/data2/porn_data_movie.csv")
# 数据格式
csv.printSchema()
root
|-- id: string (nullable = true)
|-- create: string (nullable = true)
|-- update: string (nullable = true)
|-- name: string (nullable = true)
|-- describe: string (nullable = true)
|-- source_id: string (nullable = true)
|-- publish_time: string (nullable = true)
|-- play_count: string (nullable = true)
|-- good_count: string (nullable = true)
|-- bad_count: string (nullable = true)
|-- link_count: string (nullable = true)
|-- comment_count: string (nullable = true)
|-- designation: string (nullable = true)
|-- category_id: string (nullable = true)
|-- porn_site_id: string (nullable = true)
|-- uploader_id: string (nullable = true)
|-- producer: string (nullable = true)
Porn Data Anaylize — Spark安装
spark默认使用的Python版本为2,可以修改.bashrc文件让spark默认使用python3。修改.bashrc增加如下行:
# anaconda
export ANACONDA_HOME=/home/dbuser/anaconda3/
export PATH=$ANACONDA_HOME:$PATH
# spark
export PYSPARK_PYTHON=/home/dbuser/anaconda3/bin/python3