String extraction from iterating and mixed lines in python | بلاگ

String extraction from iterating and mixed lines in python

تعرفه تبلیغات در سایت

آخرین مطالب

امکانات وب

Vote count: 0

I have a dataset as below;

"birth_date_1:25        birth_date_2:august     birth_date_3:1945    birth_place_1:france   death_date:   "
"birth_date_1:14        birth_date_2:june       birth_date_3:1995   birth_place_1:dvůr     birth_place_2:králové     birth_place_3:nad       birth_place_4:labem     birth_place_5:,     birth_place_6:czech     birth_place_7:republic  "
"birth_date_1:21        birth_date_2:february       birth_date_3:1869   birth_place_1:blackburn     birth_place_2:,     birth_place_3:england   death_date_1:12     death_date_2:march      death_date_3:1917   "
"birth_date_1:07        birth_date_2:july       birth_date_3:1979   birth_place_1:ghana     birth_place_2:,     birth_place_3:accra "
"birth_date_1:27        birth_date_2:february       birth_date_3:1979   birth_place_1:durban        birth_place_2:,     birth_place_3:south     birth_place_4:africa    "
"birth_date_1:1989  birth_place_1:lima      birth_place_2:,     birth_place_3:peru  "
"birth_date_1:5     birth_date_2:september      birth_date_3:1980   birth_place_1:angola    death_date:   "
"birth_date_1:1     birth_date_2:february       birth_date_3:1856   birth_place_1:hampstead     birth_place_2:,     birth_place_3:london    death_date_1:14     death_date_2:august     death_date_3:1905   "
"birth_date_1:28        birth_date_2:december       birth_date_3:1954   birth_place_1:hickory       birth_place_2:,     birth_place_3:north     birth_place_4:carolina  death_date:   "
"birth_date:  "
"birth_date:  birth_place:  death_date:   "
"birth_date:  birth_place_1:belfast       birth_place_2:,     birth_place_3:northern      birth_place_4:ireland   "
"birth_date:  birth_place:  death_date:   "
"birth_date_1:28        birth_date_2:february       birth_date_3:1891   birth_place_1:carberry      birth_place_2:,     birth_place_3:manitoba  death_date_1:20     death_date_2:september      death_date_3:1968   "
"birth_date_1:4     birth_date_2:november       birth_date_3:1993   birth_place_1:portim√£o     birth_place_2:,     birth_place_3:portugal  "

Within these dataset I am trying to extract information as below;

25.08.1945 t France t NA
14.06.1995 t Dvůr Králové nad Labem,Czech Republic t 
21.02.1896 t Blackburn,England t 12.03.1917
1989 t Lima,Peru t NA
NA t NA t NA
NA t NA t NA
NA t Belfast, Northern Ireland t NA
04.11.1993 t Portimeo,Portugal t NA

I wrote below code to achieve this however because of the several scenarios I will encounter in my dataset such as birth_date_1 information can be null, a month name or a year, the loop below I came up with feels like is going to fail somewhere and won't be feasible.

    outputfile = open('ornek_box_seperated_update.csv','w',encoding="utf-8")
    inputfile = open('ornek_box_seperated.csv','r',encoding="utf-8")
    import numpy as np

    birthDatePlace = [[ np.nan for i in range(9) ] for j in range(20000)]

    for line in inputfile:
        d = line.split(":")
        d = line.split(d)
        d = "t".join(d)
        if(d[1]<40 and d[1]>0):
            birthDatePlace[line,1] = d[1]
            birthDatePlace[line,3] = d[1]
        if(d[1]<40 and d[1]>0 and isinstance(d[3])==str):
            birthDatePlace[line,2] = d[3]
        elif(d[1]<2020 and isinstance(d[3])==int):
            birthDatePlace[line,4] = d[3]

        # this code planned to continue from here until cover the all birth place and death date information in required format


I appreciate any help you can provide. I am kinda newbie in python and especially in regex or string extraction methodologies.

Thank you in advance for your kind support.

asked 27 secs ago
Kaan Karabal

نویسنده : استخدام کار بازدید : 2 تاريخ : سه شنبه 9 مرداد 1397 ساعت: 4:34