0
点赞
收藏
分享

微信扫一扫

Mysql进阶【3】论述Mysql优化

343d85639154 2023-06-11 阅读 81

目录

  • information extraction
    • Given this: “Brasilia, the Brazilian capital, was founded in 1960.”
    • Obtain this:
      • capital(Brazil, Brasilia)
      • founded(Brasilia, 1960)
    • Main goal: turn text into structured data
  • applications
    • Stock analysis
      • Gather information from news and social media
      • Summarise texts into a structured format
      • Decide whether to buy/sell at current stock price
    • Medical research
      • Obtain information from articles about diseases and treatments
      • Decide which treatment to apply for new patient
  • how
    • Two steps:
      • Named Entity Recognition (NER): find out entities such as “Brasilia” and “1960”
      • Relation Extraction: use context to find the relation between “Brasilia” and “1960” (“founded”)
  • machine learning in IE
    • Named Entity Recognition (NER): sequence models such as RNNs, HMMs or CRFs.
    • Relation Extraction: mostly classifiers, either binary or multi-class.
    • This lecture: how to frame these two tasks in order to apply sequence labellers and classifiers.

Named Entity Recognition

在这里插入图片描述
在这里插入图片描述

  • typical entity tags (types of tags to use depend on domains)
    • PER(people): people, characters
    • ORG(organisation): companies, sports teams
    • LOC(natural location): regions, mountains, seas
    • GPE(man-made locations): countries, states, provinces (in some tagset this is labelled as LOC)
    • FAC(facility): bridges, buildings, airports
    • VEH(vehcle): planes, trains, cars
    • Tag-set is application-dependent: some domains deal with specific entities e.g. proteins and genes
  • NER as sequnce labelling
    • NE tags can be ambiguous:

      • “Washington” can be a person, location or political entity
    • Similar problem when doing POS tagging

      • possible solution: Incorporate(包含) context
    • Can we use a sequence tagger for this (e.g. HMM)?

      • No, as entities can span multiple tokens(multiple words)
      • Solution: modify the tag set
    • IO(inside,outside) tagging

      • [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
      • I-ORG’ represents a token that is inside an entity (ORG in this case).
      • All tokens which are not entities get the ‘O’ token (for outside).
      • Cannot differentiate between:
        • a single entity with multiple tokens
        • multiple entities with single tokens 在这里插入图片描述
    • IOB(beginning) tagging

      • [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.

      • B-ORG represents the beginning of an ORG entity.

      • If the entity has more than one token, subsequent tags are represented as I-ORG. 在这里插入图片描述

      • example: annotate the following sentence with NER tags(IOB)

        • Steves Jobs founded Apple Inc. in 1976, Tageset: PER, ORG, LOC, TIME
          • [B-PER Steves] [I-PER Jobs] [O founded] [B-ORG Apple] [I-ORG Inc.] [O in] [B-Time 1976]
    • NER as sequence labelling

      • Given such tagging scheme, we can train any sequence labelling model
      • In theory, HMMs can be used but discriminative models such as CRFs are preferred (HMMs cannot incorperate new features)
    • NER

      • features

        • Example: L’Occitane
        • Prefix/suffix:
          • L / L’ / L’O / L’Oc / …
          • e / ne / ane / tane / …
        • Word shape:
          • X’Xxxxxxxx / X’Xx
          • XXXX-XX-XX (date!)
        • POS tags / syntactic chunks: many entities are nouns or noun phrases.
        • Presence in a gazeteer: lists of entities, such as place names, people’s names and surnames, etc. 在这里插入图片描述
      • classifier 在这里插入图片描述

      • deep learning for NER

        • State of the art approach uses LSTMs with character and word embeddings (Lample et al. 2016) 在这里插入图片描述

Relation Extraction

  • relation extraction

    • [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
    • Traditionally framed as triple(a relation and two entities) extraction:
      • unit(American Airlines, AMR Corp.)
      • spokesman(Tim Wagner, American Airlines)
    • Key question: do we know all the possible relations?
      • map relations to a closed set of relations
      • unit(American Airlines, AMR Corp.) → subsidiary
      • spokesman(Tim Wagner, American Airlines) → employment 在这里插入图片描述
  • methods

    • If we have access to a fixed relation database:

      • Rule-based
      • Supervised
      • Semi-supervised
      • Distant supervision
    • If no restrictions on relations:

      • Unsupervised
      • Sometimes referred as “OpenIE”
    • rule-based relation extraction

      • “Agar is a substance prepared from a mixture of red algae such as Gelidium, for laboratory or industrial use.”
      • identify linguitics patterns in sentence
      • [NP red algae] such as [NP Gelidium]
      • NP0 such as NP1 → hyponym(NP1, NP0)
      • hyponym(Gelidium, red algae)
      • Lexico-syntactic patterns: high precision, low recall(unlikely to recover all patterns, so many linguistic patterns!), manual effort required
      • more rules 在这里插入图片描述
    • supervised relation extraction

      • Assume a corpus with annotated relations
      • Two steps (if only one step, class imbalance problem: most entities have no relations!)
        • First, find if an entity pair is related or not (binary classification)
          • For each sentence, gather all possible entity pairs
          • Annotated pairs are considered positive examples
          • Non-annotated pairs are taken as negative examples
        • Second, for pairs predicted as positive, use a multiclass classifier (e.g. SVM) to obtain the relation
        • example
          • [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
          • First:
            • (American Airlines, AMR Corp.) → \to positive
            • (American Airlines, Tim Wagner) → \to positive
            • (AMR Corp., Tim Wagner) → \to negative
          • Second:
            • (American Airlines, AMR Corp.) → \to subsidiary
            • (American Airlines, Tim Wagner) → \to employment
      • features
        • [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
        • (American Airlines, Tim Wagner) → \to employment 在这里插入图片描述
    • semi-supervised relation extraction

      • Annotated corpora is very expensive to create

      • Use seed tuples to bootstrap a classifier (use seed to find more training data)

      • steps:

        1. Given seed tuple: hub(Ryanair, Charleroi)
        2. Find sentences containing terms in seed tuples
          • Budget airline Ryanair, which uses Charleroi as a hub, scrapped all weekend flights out of the airport
        3. Extract general patterns
          • [ORG], which uses [LOC] as a hub
        4. Find new tuples with these patterns
          • hub(Jetstar, Avalon)
        5. Add these new tuples to existing tuples and repeat step 2
      • issues

        • Extracted tuples deviate from original relation over time
          • semantic drift(deviate from original relation)
            • Pattern: [NP] has a {NP}* hub at [LOC]
            • Sydney has a ferry hub at Circular Quay
              • hub(Sydney, Circular Quay)
            • More erroneous(错误的) patterns extracted from this tuple…
            • Should only accept patterns with high confidences
        • Difficult to evaluate(no labels for new extracted tuples)
        • Extracted general patterns tend to be very noisy
    • distant supervision

      • Semi-supervised methods assume the existence of seed tuples to mine new tuples

      • Can we mine new tuples directly?

      • Distant supervision obtain new tuples from a range of sources:

        • DBpedia
        • Freebase 在这里插入图片描述
      • Generate massive training sets, enabling the use of richer features, and no risk of semantic drift

    • unsupervised relation extraction

      • No fixed or closed set of relations
      • Relations are sub-sentences; usually has a verb
      • “United has a hub in Chicago, which is the headquarters of United Continental Holdings.”
        • “has a hub in”(United, Chicago)
        • “is the headquarters of”(Chicago, United Continental Holdings)
      • Main problem: so many relation forms! mapping relations into canonical forms
    • evaluation

      • NER: F1-measure at the entity level.
      • Relation Extraction with known relation set: F1-measure
      • Relation Extraction with unknown relations: much harder to evaluate
        • Usually need some human evaluation
        • Massive datasets used in these settings are impractical to evaluate manually (use samples)
        • Can only obtain (approximate) precision, not recall(too many possible relations!)

Other IE Tasks

  • temporal expression extraction

    [TIME July 2, 2007]: A fare increase initiated [TIME last week] by UAL Corp’s United Airlines was matched by competitors over [TIME the weekend], marking the second successful fare increase in [TIME two weeks].”

    • Anchoring: when is “last week”?
      • “last week” → 2007−W26
    • Normalisation: mapping expressions to canonical forms.
      • July 2, 2007 → 2007-07-02
    • Mostly rule-based approaches
  • event extraction

    • “American Airlines, a unit of AMR Corp., immediately [EVENT matched] [EVENT the move], spokesman Tim Wagner [EVENT said].”
    • Very similar to NER but different tags, including annotation and learning methods.
    • Event ordering: detect how a set of events happened in a timeline.
      • Involves both event extraction and temporal expression extraction.

Conclusion

  • Information Extraction is a vast field with many different tasks and applications
    • Named Entity Recognition
    • Relation Extraction
    • Event Extraction
  • Machine learning methods involve classifiers and sequence labelling models.
举报

相关推荐

0 条评论