构建生物医学知识图谱from zero to hero (2):文献抽取
从某文献中提取内容时(或:在某文献中),我们对文献的PDF格式文件进行图像化处理,并利用pytesseract技术实现对图像的文字信息提取。
import requests
import pdf2image
import pytesseract
pdf = requests.get('https://arxiv.org/pdf/2110.03526.pdf')
doc = pdf2image.convert_from_bytes(pdf.content)
# Get the article text
article = []
for page_number, page_data in enumerate(doc):
txt = pytesseract.image_to_string(page_data).encode("utf-8")
# Sixth page are only references
if page_number < 6:
article.append(txt.decode("utf-8"))
article_txt = " ".join(article)
AI助手
Mohammadreza Ahmadi conducted research on tissue engineering techniques focused on regenerating skin, hair follicles, and other structures derived from stem cells. This study aimed to address challenges in treating various skin conditions such as chronic wounds or diabetic ulcers. Additionally, the medical field sought methods for both aesthetic rejuvenation (cosmetic purposes) and reconstructive medicine. Furthermore, reconstructive medicine employed this approach by delivering pluripotent stem cells directly to target tissues.
接下来对文本进行处理
import nltk
nltk.download('punkt')
def clean_text(text):
"""Remove section titles and figure descriptions from text"""
clean = "\n".join([row for row in text.split("\n") if (len(row.split(" "))) > 3 and not (row.startswith("(a)"))
and not row.startswith("Figure")])
return clean
text = article_txt.split("INTRODUCTION")[1]
ctext = clean_text(text)
sentences = nltk.tokenize.sent_tokenize(ctext)
AI助手
Numerous individuals suffering from skin disorders, including chronic wounds, persistent ulcers, and diabetic ulcers, necessitated the reconstruction and regeneration of their skin surfaces. In other words, the medical field was increasingly seeking effective methods to rejuvenate and restore skin for both aesthetic reasons and therapeutic needs. This demand extended beyond the patient population to include healthy individuals as well.
