python – Extract text from pdf ignoring cropped content

I’m trying to extract text from a pdf file that has been cropped. I.e it has a defined cropbox which only displays a portion of the page.

The problem is that the cropped part still exists in pdf files, its just not visible.

I’ve tried PyPDF2, pdfquery and pdfminer. They all read the entire content including the cropped portion.

PyPDF2 lets me access the dimensions of the cropbox using:

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

But I’m not sure what I can do with it. The files are being cropped in java using apache pdfBOX. I’d prefer to only read the uncropped part of the files in python but I can also make changes to the java code cropping the files if that’s the only solution.

Any help is appreciated.

لینک منبع

java – Extract file extension from String using regex

I have the following String:


And I’m extracting the file extension (in this case "mp3") out of it.

The String varies accordingly to the file type. Some examples:


Here’s how I’ve done:

public class Test {

    private static final String BASE64_HEADER_EXP = "^data:.+;base64,";

    private static final Pattern PATTERN_BASE64_HEADER = Pattern.compile(BASE64_HEADER_EXP);

    private String data;

    private String fileName;

    public String getFileName() {
        Matcher base64HeaderMatcher = PATTERN_BASE64_HEADER.matcher(data);
        return String.format("%s.%s", getFilenameWithoutExtension(), getExtension(base64HeaderMatcher));

    private String getFilenameWithoutExtension() {
        return fileName.split("\.")[0];

    private String getExtension(Matcher base64HeaderMatcher) {
        if (base64HeaderMatcher.find()) {
            String base64Header = base64HeaderMatcher.group(0);
            return base64Header.split("/")[1].split(";")[0];
        return fileName.split("\.")[1];


What I want is a way to do it without having to split and access array positions like I’m doing above. Maybe extract the extension using a regex expression.

I’m able to do it on RegExr site using this expression:


But, when trying to use the same regex on Java, I get the error "Require that the characters immediately before the position do" because, aparently, Java doesn’t support repetition inside lookbehind:

enter image description here

لینک منبع

vim – Extract JavaScript variable (feature example): how to improve it?

I am a Vim newbie with one week experience and I’m already enjoying it.

I’ve successfully write an (ugly) command+function+mapping with which I can extract some code to a new variable in JavaScript. This is the first version, it works with motions (jsexviw) and selection in visual mode:

command! -range -nargs=1 JsExVar normal `<v`>d^[i<args>^[Ovar <args> = (^[pa);^[
function! FnJsExVar(type)
    silent exec 'JsExVar '.input("Variable name: ")
vnoremap <silent> <expr> <Leader>jsexv ":JsExVar ".input("Variable name: ")."<cr>"
nnoremap <silent> <Leader>jsexv :set opfunc=FnJsExVar<CR>g@

Then I refactored it a bit to avoid duplicated input(“Variable name: “) but now I have a weird (0) param when calling the function from the command:

command! -range JsExVar call ExecJsExVar(0) 
vnoremap <silent> <expr> <Leader>jsexv ":JsExVar<cr>"
nnoremap <silent> <Leader>jsexv :set opfunc=ExecJsExVar<CR>g@
function! ExecJsExVar(type)
    let varname = input("Variable name: ")
    silent exec "normal `<v`>di".varname."^["
    silent exec "normal Ovar ".varname." = ^["
    silent exec "normal pa;^["

I still don’t fully understand the different ways of executing things, so I assume the code can be improved and cleaned a lot. Thanks is advance for any correction and suggestion.

لینک منبع

java – Extract from HTML with JSoup

I’m new in Jsoup an i’m trying to scrap some datas from website using Jsoup.
I want to extract only datas under specific <data-id> node.
this is the webpage structure

<tr data-id="13">
  <td class="th">Dimension</td>
  <td class="l">152.5x82x9.8mm (6x3.23x0.39")</td>
<tr class="even" data-id="15">
  <td class="th">Weight</td>
  <td class="l">190gr (6.7oz)</td>
<tr class="h" data-id="116">
   <td class="th">Ringtone</td>
   <td class="l"></td>

I need to get something like this

  1. Dimension
  2. ۱۹۰gr
  3. Ringtone
    Please Help me.

PS: I’m using Java.

لینک منبع

postgresql – how to extract jsonB array elements with using GIN index

Am using 9.4 postgres

Create table temp_JsonB (ID serial , name_it jsonb);

insert into temp_JsonB (name_it) values ('[{"a":"foo"},{"b":"bar"},{"c":"baz"}]');
insert into temp_JsonB (name_it) values ('[{"c":"foo"},{"d":"hee"},{"c":"baz"}]');
insert into temp_JsonB (name_it) values ('[{"g":"ggo"},{"b":"bar"},{"c":"raz"}]');

CREATE INDEX temp_jsonb_gin
  ON temp_JsonB
  USING gin
  (name_it jsonb_path_ops);

select ID, jsonb_agg(obj)
from temp_JsonB , jsonb_array_elements(name_it) obj
where obj ->>'b' ='bar'
group by 1;

I need to get only the particular element in the array, i used the above method to get it but unfortunately it is not making use of the index, Is there any way where i can re-write the code so it make use of index.

as in original production it has 2000+ rows and takes time, so index can be helpful here

لینک منبع

Java Streams extract strings starting with "EXCHANGEID="

I have the following file:




that has multiple lines (I’ve only extracted a minor section).
I want to extract each string starting with “EXCHANGEID=” using Java Streams and print it on the console.

So I want my output to be like:


لینک منبع

Python How to extract specific string into multiple variable

i am trying to extract a specific line as variable in file.

this is content of my test.txt

#first set
Task Identification Number: 210CT1
Task title: Assignment 1
Weight: 25
fullMark: 100
Description: Program and design and complexity running time.

#second set
Task Identification Number: 210CT2
Task title: Assignment 2
Weight: 25
fullMark: 100
Description: Shortest Path Algorithm

#third set
Task Identification Number: 210CT3
Task title: Final Examination
Weight: 50
fullMark: 100
Description: Close Book Examination

this is my code

with open(home + '\Desktop\PADS Assignment\test.txt', 'r') as mod:
    for line in mod:
        taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ") 

here is what i’m trying to do:

taskNumber is 210CT1 
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time

and loop until the third set 

but there’s an error occurred in the output

ValueError: not enough values to unpack (expected 5, got 2)

لینک منبع

Matlab Matrix – how to extract columns from a matrix filter columns

I am a very beginner in matlab and i am stuck at this point of my project.

I have a matrix nxn and I want to substract the content every 3 columns and keep the result in different variables. I would like to do it in a bigger scale because I am working with a big amount of data. Is it possible to automate it?

I know that it is possible to do it in this way:
A1 = A(:,1:3); A2 = A(:,4:6); A3 = A(:,7:9)
but I would like to simplify it for managin large amount of data!

Matrix 9 x 9
A =

 ۱     ۲     ۳     ۴     ۵     ۶     ۷     ۸     ۹
 ۲     ۴     ۶     ۸    ۱۰    ۱۲    ۱۴    ۱۶    ۱۸
 ۳     ۶     ۹    ۱۲    ۱۵    ۱۸    ۲۱    ۲۴    ۲۷
 ۴     ۸    ۱۲    ۱۶    ۲۰    ۲۴    ۲۸    ۳۲    ۳۶
 ۵    ۱۰    ۱۵    ۲۰    ۲۵    ۳۰    ۳۵    ۴۰    ۴۵
 ۶    ۱۲    ۱۸    ۲۴    ۳۰    ۳۶    ۴۲    ۴۸    ۵۴
 ۷    ۱۴    ۲۱    ۲۸    ۳۵    ۴۲    ۴۹    ۵۶    ۶۳
 ۸    ۱۶    ۲۴    ۳۲    ۴۰    ۴۸    ۵۶    ۶۴    ۷۲
 ۹    ۱۸    ۲۷    ۳۶    ۴۵    ۵۴    ۶۳    ۷۲    ۸۱

Expected result:

A1 =

 ۱  ۲   ۳ 
 ۲  ۴   ۶
 ۳  ۶   ۹
 ۴  ۸   ۱۲
 ۵  ۱۰  ۱۵
 ۶  ۱۲  ۱۸
 ۷  ۱۴  ۲۱
 ۸  ۱۶  ۲۴
 ۹  ۱۸  ۲۷

A2 =

 ۴  ۵   ۷ 
 ۸  ۱۰  ۱۲
 ۱۲ ۱۵  ۱۸
 ۱۶ ۲۰  ۲۴ 
 ۲۰ ۲۵  ۳۰
 ۲۴ ۳۰  ۳۶
 ۲۸ ۳۵  ۴۲ 
 ۳۲ ۴۰  ۴۸
 ۳۶ ۴۵  ۵۴

A3 =

 ۷  ۸   ۹ 
 ۱۴ ۱۶  ۱۸
 ۲۱ ۲۴  ۲۷
 ۲۸ ۳۲  ۳۶ 
 ۳۵ ۴۰  ۴۵
 ۴۲ ۴۸  ۵۴
 ۴۹ ۵۶  ۶۳  
 ۵۶ ۶۴  ۷۲ 
 ۶۳ ۷۲  ۸۱

Thank you very much in advance!!!!

لینک منبع

extract a linguistic structure based on POS tagged sentence using Stanford nlp in JAVA

I am new in NLP, I want to do POS tagging and then do find a specific structure within a text. I could manage POS tagging using Stanford NLP but, I can not extract the structure: NN/NNS +IN+DET+NN/NNS/NNP/NNPS

public static void main(String args[]) throws Exception{
    //input File
    String contentFilePath = "";
    String triplesFilePath = contentFilePath.substring(0, contentFilePath.length()-4)+"_postagg.txt";

    //document to POS tagging
    String content = getFileContent(contentFilePath);

    Properties props = new Properties();

    props.setProperty("annotators","tokenize, ssplit, pos");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    // Annotate the document.
    Annotation doc = new Annotation(content);

    // Annotate the document.
    List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class);
    for (CoreMap sentence : sentences) {
        for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
            String word = token.get(CoreAnnotations.TextAnnotation.class);
            // this is the POS tag of the token
            String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
            System.out.println(word + "/" + pos);
        } }}}

لینک منبع